Abstract
Uncertain data streams have been widely generated in many Web applications. The uncertainty in data streams makes anomaly detection from sensor data streams far more challenging. In this article, we present a novel framework that supports anomaly detection in uncertain data streams. The proposed framework adopts the wavelet soft-thresholding method to remove the noises or errors in data streams. Based on the refined data streams, we develop effective period pattern recognition and feature extraction techniques to improve the computational efficiency. We use classification methods for anomaly detection in the corrected data stream. We also empirically show that the proposed approach shows a high accuracy of anomaly detection on several real datasets.
- P. M. Agante and J. P. M. de Sa. 1999. ECG noise filtering using wavelets with soft-thresholding methods. In Proceedings of the 1999 Computers in Cardiology Conference. 535--538. DOI:http://dx.doi.org/10.1109/CIC.1999.826026Google Scholar
Cross Ref
- Charu C. Aggarwal. 2009. On high dimensional projected clustering of uncertain data streams. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE’09). IEEE, Los Alamitos, CA, 1152--1154. DOI:http://dx.doi.org/10.1109/ICDE.2009.188 Google Scholar
Digital Library
- Charu C. Aggarwal and Philip S. Yu. 2008. A framework for clustering uncertain data streams. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE, Los Alamitos, CA, 150--159. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497423 Google Scholar
Digital Library
- Ian F. Akyildiz, Dario Pompili, and Tommaso Melodia. 2005. Underwater acoustic sensor networks: Research challenges. Ad Hoc Networks 3, 3, 257--279. DOI:http://dx.doi.org/10.1016/j.adhoc.2005.01.004Google Scholar
Cross Ref
- Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. The CQL Continuous Query Language: Semantic Foundations and Query Execution. Technical Report 2003-67. Stanford InfoLab. http://ilpubs.stanford.edu:8090/758/Google Scholar
- David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). 1027--1035. http://dl.acm.org/citation.cfm?id=1283383.1283494. Google Scholar
Digital Library
- Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, and Matthias Renz. 2009. Probabilistic similarity search for uncertain time series. In Scientific and Statistical Database Management.Lecture Notes in Computer Science, Vol. 5566. Springer, 435--443. DOI:http://dx.doi.org/10.1007/978-3-642-02279-1_31 Google Scholar
Digital Library
- Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’00). 379--390. http://doi.acm.org/10.1145/342009.335432. Google Scholar
Digital Library
- D. L. Swain, M. A. Friend, G. J. Bishop-Hurley, R. N. Handcock, and T. Wark. 2011. Tracking livestock using global positioning systems -- are we still lost? Animal Production Science 51, 167--175.Google Scholar
Cross Ref
- Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, and Themis Palpanas. 2012. Uncertain time-series similarity: Return to the basics. Proceedings of the VLDB Endowment 5, 11, 1662--1673. DOI:http://dx.doi.org/10.14778/2350229.2350278 Google Scholar
Digital Library
- David Leigh Donoho. 1995. De-noising by soft-thresholding. IEEE Transactions on Information Theory 41, 3, 613--627. DOI:http://dx.doi.org/10.1109/18.382009 Google Scholar
Digital Library
- David H. Douglas and Thomas K. Peucker. 1973. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 10, 2, 112--122.Google Scholar
- Philippe Esling and Carlos Agon. 2012. Time-series data mining. ACM Computing Surveys 45, 1, Article No. 12. DOI:http://dx.doi.org/10.1145/2379776.2379788 Google Scholar
Digital Library
- Victor A. Folarin, Patrick J. Fitzsimmons, and William B. Kruyer. 2001. Holter monitor findings in asymptomatic male military aviators without structural heart disease. Aviation, Space, and Environmental Medicine 72, 9, 836--838. http://www.ncbi.nlm.nih.gov/pubmed/11565820.Google Scholar
- Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. 2000. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals. Circulation 101, 23, e215--e220.Google Scholar
Cross Ref
- Aslak Grinsted, John C. Moore, and Svetlana Jevrejeva. 2004. Application of the cross wavelet transform and wavelet coherence to geophysical time series. Nonlinear Processes in Geophysics 11, 5-6, 561--566. DOI:http://dx.doi.org/10.5194/npg-11-561-2004Google Scholar
Cross Ref
- Yu Gu, Andrew McCallum, and Don Towsley. 2005. Detecting anomalies in network traffic using maximum entropy estimation. In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement (IMC’05). 32. http://dl.acm.org/citation.cfm?id=1251086.1251118. Google Scholar
Digital Library
- Joachim Gudmundsson, Marc van Kreveld, and Bettina Speckmann. 2007. Efficient detection of patterns in 2D trajectories of moving points. GeoInformatica 11, 2, 195--215. DOI:http://dx.doi.org/10.1007/s10707-006-0002-z Google Scholar
Digital Library
- Şule Gündüz and M. Tamer Özsu. 2003. A Web page prediction model based on click-stream tree representation of user behavior. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 535--540. Google Scholar
Digital Library
- John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1, 100--108. http://www.jstor.org/stable/2346830.Google Scholar
Cross Ref
- Jing He, Yanchun Zhang, and Guangyan Huang. 2012. Exceptional object analysis for finding rare environmental events from water quality datasets. Neurocomputing 92, 0, 69--77. DOI:http://dx.doi.org/10.1016/j.neucom.2011.08.036 Data Mining Applications and Case Study. Google Scholar
Digital Library
- John Hershberger and Jack Snoeyink. 1994. An O(Nlogn) implementation of the Douglas-Peucker algorithm for line simplification. In Proceedings of the 10th Annual Symposium on Computational Geometry (SCG’94). ACM, New York, NY, 383--384. DOI:http://dx.doi.org/10.1145/177424.178097 Google Scholar
Digital Library
- Guangyan Huang, Yanchun Zhang, Jie Cao, Michael Steyn, and Kersi Taraporewalla. 2014. Online mining abnormal period patterns from multiple medical sensor data streams. World Wide Web 17, 4, 569--587. DOI:http://dx.doi.org/10.1007/s11280-013-0203-y Google Scholar
Digital Library
- Ruoyi Jiang, Hongliang Fei, and Jun Huan. 2011. Anomaly localization for network data streams with graph joint sparse PCA. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 886--894. DOI:http://dx.doi.org/10.1145/2020408.2020557 Google Scholar
Digital Library
- Eamonn Keogh, Jessica Lin, and Ada Fu. 2005. HOT SAX: Efficiently finding the most unusual time series subsequence. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). IEEE, Los Alamitos, CA, 226--233. DOI:http://dx.doi.org/10.1109/ICDM.2005.79 Google Scholar
Digital Library
- Eamonn J. Keogh, Selina Chu, David Hart, and Michael Pazzani. 2004. Segmenting time series: A survey and novel approach. In Data Mining in Time Series Databases, M. Last, A. Kandel, and H. Bunke (Eds.). Series in Machine Perception and Artificial Intelligence, Vol. 57. World Scientific Publishing Company, 1--22.Google Scholar
- Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. 2007. Trajectory clustering: A partition-and-group framework. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, NY, 593--604. DOI:http://dx.doi.org/10.1145/1247480.1247546 Google Scholar
Digital Library
- Daniel Lemire. 2007. A better alternative to piecewise linear time series segmentation. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM’07). 545--550.Google Scholar
Cross Ref
- Carson Kai-Sang Leung and Boyu Hao. 2009. Mining of frequent item-sets from streams of uncertain data. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE’09). IEEE, Los Alamitos, CA, 1663--1670. DOI:http://dx.doi.org/10.1109/ICDE.2009.157 Google Scholar
Digital Library
- Matthew N. Levy and Achilles J. Pappano. 2007. Cardiovascular Physiology. Mosby Elsevier.Google Scholar
- Bai Ling Zhang, Yanchun Zhang, and Rezaul K. Begg. 2009. Gait classification in children with cerebral palsy by Bayesian approach. Pattern Recognition 42, 4, 581--586. DOI:http://dx.doi.org/10.1016/j.patcog.2008.09.025 Google Scholar
Digital Library
- X. Liu, X. Wu, H. Wang, R. Zhang, J. Bailey, and K. Ramamohanarao. 2010. Mining distribution change in stock order streams. In Proceedings of the IEEE 26th International Conference on Data Engineering (VLDB’04). 105--108. DOI:http://dx.doi.org/10.1109/ICDE.2010.5447901Google Scholar
- Melanie Manning and Louanne Hudgins. 2010. Array-based technology and recommendations for utilization in medical genetics practice for detection of chromosomal abnormalities. Genetics in Medicine 12, 11, 742--745.Google Scholar
Cross Ref
- Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2001. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.). Vol. 14. MIT Press, Cambridge, MA, 849--856.Google Scholar
Digital Library
- Themis Palpanas, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos. 2008. Streaming time series summarization using user-defined amnesic functions. IEEE Transactions on Knowledge and Data Engineering 20, 7, 992--1006. DOI:http://dx.doi.org/10.1109/TKDE.2007.190737 Google Scholar
Digital Library
- Jianzhong Qi, Rui Zhang, Kotagiri Ramamohanarao, Hongzhi Wang, Zeyi Wen, and Dan Wu. 2015. Indexable online time series segmentation with error bound guarantee. World Wide Web 18, 2, 359--401. DOI:http://dx.doi.org/10.1007/s11280-013-0256-y Google Scholar
Digital Library
- Douglas Reynolds. 2009. Gaussian mixture models. In Encyclopedia of Biometrics, S. Z. Li and A. Jain (Eds.). Springer, 659--663. DOI:http://dx.doi.org/10.1007/978-0-387-73003-5_196Google Scholar
- Paul L. Rosin. 2003. Assessing the behaviour of polygonal approximation algorithms. Pattern Recognition 36, 2, 505--518. DOI:http://dx.doi.org/10.1016/S0031-3203(02)00076-6 Biometrics.Google Scholar
Cross Ref
- Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 0, 53--65. DOI:http://dx.doi.org/10.1016/0377-0427(87)90125-7 Google Scholar
Digital Library
- Smruti R. Sarangi and Karin Murthy. 2010. DUST: A generalized notion of similarity between uncertain time series. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, NY, 383--392. DOI:http://dx.doi.org/10.1145/1835804.1835854 Google Scholar
Digital Library
- Galit Shmueli and Howard Burkom. 2010. Statistical challenges facing early outbreak detection in biosurveillance. Technometrics 52, 1, 39--51.Google Scholar
Cross Ref
- Ikaro Silva and George Moody. 2014. An open-source toolbox for analysing and processing physionet databases in MATLAB and octave. Journal of Open Research Software 2, 1, e27.Google Scholar
Cross Ref
- Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, and Xinbiao Zhou. 2007. Effective variation management for pseudo periodical streams. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, NY, 257--268. DOI:http://dx.doi.org/10.1145/1247480.1247511 Google Scholar
Digital Library
- Mahbod Tavallaee, Natalia Stakhanova, and Ali Akbar Ghorbani. 2010. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 40, 5, 516--524. DOI:http://dx.doi.org/10.1109/TSMCC.2010.2048428 Google Scholar
Digital Library
- Thanh T. Tran, Liping Peng, Yanlei Diao, Andrew Mcgregor, and Anna Liu. 2012. CLARO: Modelling and processing uncertain data streams. VLDB Journal 21, 5, 651--676. DOI:http://dx.doi.org/10.1007/s00778-011-0261-7 Google Scholar
Digital Library
- William Wilson, Phil Birkin, and Uwe Aickelin. 2008. The motif tracking algorithm. International Journal of Automation and Computing 5, 1, 32--44.Google Scholar
Cross Ref
- Zhenghua Xu, Rui Zhang, Ramamohanarao Kotagiri, and Udaya Parampalli. 2012. An adaptive algorithm for online time series segmentation with error bound guarantee. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12). ACM, New York, NY, 192--203. DOI:http://dx.doi.org/10.1145/2247596.2247620 Google Scholar
Digital Library
- Yu Zheng, Xing Xie, and Wei-Ying Ma. 2010. GeoLife: A collaborative social networking service among user, location and trajectory. IEEE Data Engineering Bulletin 33, 2, 32--39. http://sites.computer.org/debull/A10june/geolife.pdf.Google Scholar
- Yunyue Zhu and Dennis Shasha. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). 358--369. http://dl.acm.org/citation.cfm?id=1287369.1287401. Google Scholar
Digital Library
Index Terms
Supervised Anomaly Detection in Uncertain Pseudoperiodic Data Streams
Recommendations
Efficient join processing on uncertain data streams
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementJoin processing in the streaming environment has many practical applications such as data cleaning and outlier detection. Due to the inherent uncertainty in the real-world data, it has become an increasingly important problem to consider the join ...
Reservoir-based network traffic stream summarization for anomaly detection
Summarization is an important intermediate step for expediting knowledge discovery tasks such as anomaly detection. In the context of anomaly detection from data stream, the summary needs to represent both anomalous and normal data. But streaming data ...
Strategies for data stream mining method applied in anomaly detection
Anomaly detection, which is a method of intrusion detection, detects anomaly behaviors and protects network security. Data mining technology has been integrated to improve the performance of anomaly detection and some algorithms have been improved for ...






Comments