Data perturbation for outlier detection ensembles | Proceedings of the 26th International Conference on Scientific and Statistical Database Management
Article No.: 13, Pages 1 - 12
Abstract
Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Building an ensemble requires learning of diverse models and combining these diverse models in an appropriate way. We propose data perturbation as a new technique to induce diversity in individual outlier detectors as well as a rank accumulation method for the combination of the individual outlier rankings in order to construct an outlier detection ensemble. In an extensive evaluation, we study the impact, potential, and shortcomings of this new approach for outlier detection ensembles. We show that this ensemble can significantly improve over weak performing base methods.
References
[1]
N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504--509, 2006.
[2]
E. Achtert, H.-P. Kriegel, L. Reichert, E. Schubert, R. Wojdanowski, and A. Zimek. Visual evaluation of outlier detection models. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 396--399, 2010.
[3]
E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. Interactive data mining with 3D-Parallel-Coordinate-Trees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York City, NY, pages 1009--1012, 2013.
[4]
C. C. Aggarwal. Outlier ensembles {position paper}. ACM SIGKDD Explorations, 14(2):49--58, 2012.
[5]
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Helsinki, Finland, pages 15--26, 2002.
[6]
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3rd edition, 1994.
[7]
R. J. Beckman and R. D. Cook. Outliers. Technometrics, 25(2):119--149, 1983.
[8]
A. Bertoni and G. Valentini. Ensembles based on random projections to improve the accuracy of clustering algorithms. In 16th Italian Workshop on Neural Nets (WIRN), and International Workshop on Natural and Artificial Immune Systems (NAIS), Vietri sul Mare, Italy, pages 31--37, 2005.
[9]
M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, N. Hayward, and J. Trent. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406:536--540, 2000.
[10]
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 93--104, 2000.
[11]
G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.
[12]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):Article 15, 1--58, 2009.
[13]
X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL, 2014.
[14]
T. G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, pages 1--15, 2000.
[15]
S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090--1099, 2003.
[16]
V. Estivill-Castro. The instance easiness of supervised learning for cluster validity. In Proceedings of the PAKDD Workshop on Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models (QIMIE), pages 197--208, 2011.
[17]
X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, pages 186--193, 2003.
[18]
A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.
[19]
A. L. N. Fred and A. K. Jain. Robust data clustering. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), Madison, WI, pages 128--136, 2003.
[20]
J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 212--221, 2006.
[21]
A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):57--70, 2009.
[22]
S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006.
[23]
J. Handl, J. Knowles, and D. B. Kell. Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):3201--3212, 2005.
[24]
L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral clustering with perturbed data. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, pages 705--712, 2008.
[25]
F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012.
[26]
M. K. Kerr and G. A. Churchill. Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Sciences of the United States of America, 98(16):8961--8965, 2001.
[27]
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, pages 219--222, 1997.
[28]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1649--1652, 2009.
[29]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 13--24, 2011.
[30]
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 157--166, 2005.
[31]
C. Li and W. H. Wong. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology, 2(8:research0032):1--11, 2001.
[32]
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3:1--39, 2012.
[33]
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 368--383, 2010.
[34]
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 315--326, 2003.
[35]
E. S. Pearson and C. Chandra Sekar. The efficiency of statistical tools and a criterion for the rejection of outlying observations. Biometrika, 28(3/4):308--320, 1936.
[36]
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 427--438, 2000.
[37]
W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.
[38]
P. J. Rousseeuw and M. Hubert. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):73--79, 2011.
[39]
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pages 1047--1058, 2012.
[40]
E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
[41]
E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 28(1):190--237, 2014.
[42]
T. Soler and M. Chin. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention, pages 393--406, 1985.
[43]
A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002.
[44]
A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866--1881, 2005.
[45]
G. Valentini and F. Masulli. Ensembles of learning machines. In Proceedings of the 13th Italian Workshop on Neural Nets, Vietri, Italy, pages 3--22, 2002.
[46]
J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 776--784, 2008.
[47]
K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 813--822, 2009.
[48]
A. Zimek, R. J. G. B. Campello, and J. Sander. Ensembles for unsupervised outlier detection: Challenges and research questions {position paper}. ACM SIGKDD Explorations, 15(1):11--22, 2013.
[49]
A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, 2013.
[50]
A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363--387, 2012.
Information & Contributors
Information
Published In
SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management
June 2014
417 pages
Copyright © 2014 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 30 June 2014
Permissions
Request permissions for this article.
Check for updates
Author Tags
Qualifiers
- Research-article
Conference
SSDBM '14
Acceptance Rates
SSDBM '14 Paper Acceptance Rate 26 of 71 submissions, 37%;
Overall Acceptance Rate 56 of 146 submissions, 38%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Reflects downloads up to 18 Feb 2025
Other Metrics
Citations
- Patel VKapoor ASharma AChakrabarti S(2023)Taxonomy of outlier detection methods for power system measurementsEnergy Conversion and Economics10.1049/enc2.120824:2(73-88)Online publication date: 20-Apr-2023
- Marques HSwersky LSander JCampello RZimek A(2023)On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensemblesData Mining and Knowledge Discovery10.1007/s10618-023-00931-x37:4(1473-1517)Online publication date: 16-May-2023
- Perini LGalvin CVercruyssen V(2021)A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection MethodsECML PKDD 2020 Workshops10.1007/978-3-030-65965-3_27(397-408)Online publication date: 2-Feb-2021
- Bii JRimiru RMwangi R(2021) OAAE : Optimized adaptive anomaly detection ensemble—Base model boosting by parameter optimization Engineering Reports10.1002/eng2.124494:2Online publication date: 22-Aug-2021
- Marques HCampello RSander JZimek A(2020)Internal Evaluation of Unsupervised Outlier DetectionACM Transactions on Knowledge Discovery from Data10.1145/339405314:4(1-42)Online publication date: 26-Jun-2020
- Zhang JLi ZChen S(2020)Diversity Aware-Based Sequential Ensemble Learning for Robust Anomaly DetectionIEEE Access10.1109/ACCESS.2020.29768508(42349-42363)Online publication date: 2020
- Reunanen NRäty TLintonen T(2020)Automatic optimization of outlier detection ensembles using a limited number of outlier examplesInternational Journal of Data Science and Analytics10.1007/s41060-020-00222-4Online publication date: 8-Jun-2020
- Wang HBah MHammad M(2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
- Xue HLiu QHou JWan Y(2019)Abnormal Data Region Discrimination and Cross-Monitoring Points Historical Correlation Repair of Water Intake DataBig Data10.1089/big.2018.01487:2(99-113)Online publication date: Jun-2019
- Campos GMeira WZimek A(2018)Outlier Detection in GraphsProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227646(1-12)Online publication date: 25-Jun-2018
- Show More Cited By
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.