Abstract
We propose Predictable Outliers in Data-trendS (PODS), a method that, given a collection of temporal datasets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether their relationships are meaningful, we develop a new criterion that does so by checking if these relationships could have been predicted from non-outliers, i.e., whether we could see the outlier relationships coming. Finally, searching for meaningful outlier relationships between every pair of datasets in a large data collection is computationally infeasible. To address that, we propose an indexing strategy that prunes irrelevant comparisons across datasets, making the approach scalable. We present the results of an experimental evaluation using real datasets and different baselines, which demonstrates the effectiveness, robustness, and scalability of our approach.
- 311-heating-complaint [n.d.]. Heat or Hot Water Complaint. Retrieved from https://www1.nyc.gov/nyc-resources/service/1813/heat-or-hot-water-complaint.Google Scholar
- Charu C. Aggarwal. 2013. Outlier Analysis. Springer.Google Scholar
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2011. Modern Information Retrieval (2nd ed.). Pearson Addison Wesley, Harlow, England.Google Scholar
- Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In SIGMOD. 541--556.Google Scholar
- Luciano Barbosa, Kien Pham, Cláudio Silva, Marcos R. Vieira, and Juliana Freire. 2014. Structured open urban data: Understanding the landscape. Big Data 2, 3 (2014), 144--154.Google Scholar
Cross Ref
- Mohamad Adam Bujang and Nurakmal Baharum. 2016. Sample size guideline for correlation analysis. World J. Soc. Sci. Res. 3, 03 (2016), 37.Google Scholar
- Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo. 2013. Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 04 (2013).Google Scholar
- Chicago 2018. City of Chicago Data Portal. Retrieved from https://data.cityofchicago.org.Google Scholar
- Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data polygamy: The many-many relationships among urban spatio-temporal data sets. In SIGMOD. 1011--1025.Google Scholar
- Jacob Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences. Routledge.Google Scholar
- Jacob Cohen. 1992. A power primer. Psychol. Bull 112 (1992), 155--159.Google Scholar
Cross Ref
- Xuan-Hong Dang, Barbora Micenková, Ira Assent, and Raymond T. Ng. 2013. Local outlier detection with interpretation. In ECML/PKDD (3), Lecture Notes in Computer Science, Vol. 8190. Springer, 304--320.Google Scholar
- Tamraparni Dasu and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. John Wiley.Google Scholar
- Tamraparni Dasu, Ji Meng Loh, and Divesh Srivastava. 2014. Empirical glitch explanations. In KDD. ACM, 572--581.Google Scholar
- Tamraparni Dasu, Vladislav Shkapenyuk, Divesh Srivastava, and Deborah F. Swayne. 2015. FIT to monitor feed quality. Proc. VLDB 8, 12 (2015), 1728--1739.Google Scholar
Digital Library
- Bradley Efron and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Chapman 8 Hall/CRC, Boca Raton, FL.Google Scholar
- Ingrid Gould Ellen, Johanna Lacoe, and Claudia Ayanna Sharygin. 2013. Do foreclosures cause crime? J. Urban Econ. 74, C (2013), 59--70.Google Scholar
- Brett Goldstein and Lauren Dyson. 2013. Beyond Transparency: Open Data and the Future of Civic Innovation. Code for America Press.Google Scholar
- John A. Gubner. 2006. Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press.Google Scholar
- James Douglas Hamilton. 1994. Time Series Analysis. Vol. 2. Princeton University Press Princeton, NJ.Google Scholar
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer, New York.Google Scholar
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd ed.). Springer.Google Scholar
- Nguyen Ho, Huy Vo, and Mai Vu. 2016. An adaptive information-theoretic approach for identifying temporal correlations in big data sets. In IEEE Big Data. IEEE Computer Society, 666--675.Google Scholar
- Boris Iglewicz and David Hoaglin. 1993. How to Detect and Handle Outliers. American Society for Quality Control, Milwaukee, WI.Google Scholar
- Edwin M. Knorr and Raymond T. Ng. 1999. Finding intensional knowledge of distance-based outliers. In VLDB. 211--222.Google Scholar
- Flip Korn, Alexandros Labrinidis, Yannis Kotidis, and Christos Faloutsos. 2000. Quantifiable data mining using ratio rules. VLDB J. 8, 3–4 (2000), 254--266.Google Scholar
- Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2009. LoOP: Local outlier probabilities. In CIKM. 1649--1652.Google Scholar
- Zhengjie Miao, Qitian Zeng, Chenjie Li, Boris Glavic, Oliver Kennedy, and Sudeepa Roy. 2019. CAPE: Explaining outliers by counterbalancing. Proc. VLDB 12, 12 (2019), 1806--1809.Google Scholar
Digital Library
- Kevin P. Murphy. 2013. Machine Learning : A Probabilistic Perspective. MIT Press.Google Scholar
- New York City [n.d.]. NYC Vision Zero Initiative. Retrieved from http://www1.nyc.gov/site/visionzero/index.page.Google Scholar
- New York City 2018. NYC Open Data. Retrieved from https://opendata.cityofnewyork.us/.Google Scholar
- nyc-summer-eat-out 2019. The Best Time to Eat Out in NYC Is in the Summer. Retrieved from https://ny.eater.com/2019/6/14/18638711/summer-dining-in-nyc-best.Google Scholar
- ParisData [n.d.]. Paris Data. Retieved from https://opendata.paris.fr.Google Scholar
- Friedrich Pukelsheim. 1994. The three sigma rule. Am. Stat. 48, 2 (1994), 88--91.Google Scholar
- C. R. Rao. 1973. Linear Statistical Inference and Its Applications. Wiley, New York.Google Scholar
- restaurant-inspection [n.d.]. Food Establishment Inspections. Retrieved from https://www1.nyc.gov/site/doh/services/restaurant-grades.page.Google Scholar
- RioOpenData [n.d.]. Portal de Armazenamento de Dados—Rio de Janeiro. Retrieved from http://www.data.rio.Google Scholar
- S. W. Roberts. 1959. Control chart tests based on geometric moving averages. Technometrics 1, 3 (1959), 239--250.Google Scholar
Cross Ref
- Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proc. VLDB 9, 4 (2015), 348--359.Google Scholar
Digital Library
- San Francisco 2018. San Francisco Open Data. Retrieved from https://datasf.org/opendata/.Google Scholar
- Michael H. Schill, Ingrid Gould Ellen, Amy Ellen Schwartz, and Ioan Voicu. 2002. Revitalizing inner-city neighborhoods: New york city’s ten-year plan. Hous. Policy Debate 13, 3 (2002), 529--566.Google Scholar
- Leah Schinasi and Ghassan B. Hamra. 2017. A time series analysis of associations between daily temperature and crime events in philadelphia, pennsylvania. J. Urban Health 94, 6 (2017), 892--900.Google Scholar
- Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining quantitative association rules in large relational tables. In SIGMOD. 1--12.Google Scholar
- Sidney Tsang, Yun Sing Koh, and Gillian Dobbie. 2013. Finding interesting rare association rules using rare pattern tree. Trans. Large-Scale Data Knowl.-Center. Syst. 8 (2013), 157--173.Google Scholar
- Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A diagnostic tool for data errors. In SIGMOD. 1231--1245.Google Scholar
- wikipedia-sandy [n.d.]. Wikipedia entry on hurricane Sandy. Retrieved from https://en.wikipedia.org/wiki/Hurricane_Sandy.Google Scholar
- Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proc. VLDB 6, 8 (2013), 553--564.Google Scholar
Digital Library
- Hyunyoon Yun, Danshim Ha, Buhyun Hwang, and Keun Ho Ryu. 2003. Mining association rules on significant rare data using relative support. J. Syst. Softw. 67, 3 (2003), 181--191.Google Scholar
Digital Library
- Haopeng Zhang, Yanlei Diao, and Alexandra Meliou. 2017. EXstream: Explaining anomalies in event stream monitoring. In EDBT. 156--167.Google Scholar
Index Terms
Effective Discovery of Meaningful Outlier Relationships
Recommendations
A survey on outlier explanations
AbstractWhile many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is difficult for users to promptly take appropriate actions concerning the detected ...
On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study
The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for ...
Discovering outlying aspects in large datasets
We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can ...






Comments