skip to main content
research-article
Open Access

Effective Discovery of Meaningful Outlier Relationships

Published:12 June 2020Publication History
Skip Abstract Section

Abstract

We propose Predictable Outliers in Data-trendS (PODS), a method that, given a collection of temporal datasets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether their relationships are meaningful, we develop a new criterion that does so by checking if these relationships could have been predicted from non-outliers, i.e., whether we could see the outlier relationships coming. Finally, searching for meaningful outlier relationships between every pair of datasets in a large data collection is computationally infeasible. To address that, we propose an indexing strategy that prunes irrelevant comparisons across datasets, making the approach scalable. We present the results of an experimental evaluation using real datasets and different baselines, which demonstrates the effectiveness, robustness, and scalability of our approach.

References

  1. 311-heating-complaint [n.d.]. Heat or Hot Water Complaint. Retrieved from https://www1.nyc.gov/nyc-resources/service/1813/heat-or-hot-water-complaint.Google ScholarGoogle Scholar
  2. Charu C. Aggarwal. 2013. Outlier Analysis. Springer.Google ScholarGoogle Scholar
  3. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2011. Modern Information Retrieval (2nd ed.). Pearson Addison Wesley, Harlow, England.Google ScholarGoogle Scholar
  4. Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In SIGMOD. 541--556.Google ScholarGoogle Scholar
  5. Luciano Barbosa, Kien Pham, Cláudio Silva, Marcos R. Vieira, and Juliana Freire. 2014. Structured open urban data: Understanding the landscape. Big Data 2, 3 (2014), 144--154.Google ScholarGoogle ScholarCross RefCross Ref
  6. Mohamad Adam Bujang and Nurakmal Baharum. 2016. Sample size guideline for correlation analysis. World J. Soc. Sci. Res. 3, 03 (2016), 37.Google ScholarGoogle Scholar
  7. Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo. 2013. Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 04 (2013).Google ScholarGoogle Scholar
  8. Chicago 2018. City of Chicago Data Portal. Retrieved from https://data.cityofchicago.org.Google ScholarGoogle Scholar
  9. Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. 2016. Data polygamy: The many-many relationships among urban spatio-temporal data sets. In SIGMOD. 1011--1025.Google ScholarGoogle Scholar
  10. Jacob Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences. Routledge.Google ScholarGoogle Scholar
  11. Jacob Cohen. 1992. A power primer. Psychol. Bull 112 (1992), 155--159.Google ScholarGoogle ScholarCross RefCross Ref
  12. Xuan-Hong Dang, Barbora Micenková, Ira Assent, and Raymond T. Ng. 2013. Local outlier detection with interpretation. In ECML/PKDD (3), Lecture Notes in Computer Science, Vol. 8190. Springer, 304--320.Google ScholarGoogle Scholar
  13. Tamraparni Dasu and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. John Wiley.Google ScholarGoogle Scholar
  14. Tamraparni Dasu, Ji Meng Loh, and Divesh Srivastava. 2014. Empirical glitch explanations. In KDD. ACM, 572--581.Google ScholarGoogle Scholar
  15. Tamraparni Dasu, Vladislav Shkapenyuk, Divesh Srivastava, and Deborah F. Swayne. 2015. FIT to monitor feed quality. Proc. VLDB 8, 12 (2015), 1728--1739.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bradley Efron and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Chapman 8 Hall/CRC, Boca Raton, FL.Google ScholarGoogle Scholar
  17. Ingrid Gould Ellen, Johanna Lacoe, and Claudia Ayanna Sharygin. 2013. Do foreclosures cause crime? J. Urban Econ. 74, C (2013), 59--70.Google ScholarGoogle Scholar
  18. Brett Goldstein and Lauren Dyson. 2013. Beyond Transparency: Open Data and the Future of Civic Innovation. Code for America Press.Google ScholarGoogle Scholar
  19. John A. Gubner. 2006. Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press.Google ScholarGoogle Scholar
  20. James Douglas Hamilton. 1994. Time Series Analysis. Vol. 2. Princeton University Press Princeton, NJ.Google ScholarGoogle Scholar
  21. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer, New York.Google ScholarGoogle Scholar
  22. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd ed.). Springer.Google ScholarGoogle Scholar
  23. Nguyen Ho, Huy Vo, and Mai Vu. 2016. An adaptive information-theoretic approach for identifying temporal correlations in big data sets. In IEEE Big Data. IEEE Computer Society, 666--675.Google ScholarGoogle Scholar
  24. Boris Iglewicz and David Hoaglin. 1993. How to Detect and Handle Outliers. American Society for Quality Control, Milwaukee, WI.Google ScholarGoogle Scholar
  25. Edwin M. Knorr and Raymond T. Ng. 1999. Finding intensional knowledge of distance-based outliers. In VLDB. 211--222.Google ScholarGoogle Scholar
  26. Flip Korn, Alexandros Labrinidis, Yannis Kotidis, and Christos Faloutsos. 2000. Quantifiable data mining using ratio rules. VLDB J. 8, 3–4 (2000), 254--266.Google ScholarGoogle Scholar
  27. Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2009. LoOP: Local outlier probabilities. In CIKM. 1649--1652.Google ScholarGoogle Scholar
  28. Zhengjie Miao, Qitian Zeng, Chenjie Li, Boris Glavic, Oliver Kennedy, and Sudeepa Roy. 2019. CAPE: Explaining outliers by counterbalancing. Proc. VLDB 12, 12 (2019), 1806--1809.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kevin P. Murphy. 2013. Machine Learning : A Probabilistic Perspective. MIT Press.Google ScholarGoogle Scholar
  30. New York City [n.d.]. NYC Vision Zero Initiative. Retrieved from http://www1.nyc.gov/site/visionzero/index.page.Google ScholarGoogle Scholar
  31. New York City 2018. NYC Open Data. Retrieved from https://opendata.cityofnewyork.us/.Google ScholarGoogle Scholar
  32. nyc-summer-eat-out 2019. The Best Time to Eat Out in NYC Is in the Summer. Retrieved from https://ny.eater.com/2019/6/14/18638711/summer-dining-in-nyc-best.Google ScholarGoogle Scholar
  33. ParisData [n.d.]. Paris Data. Retieved from https://opendata.paris.fr.Google ScholarGoogle Scholar
  34. Friedrich Pukelsheim. 1994. The three sigma rule. Am. Stat. 48, 2 (1994), 88--91.Google ScholarGoogle Scholar
  35. C. R. Rao. 1973. Linear Statistical Inference and Its Applications. Wiley, New York.Google ScholarGoogle Scholar
  36. restaurant-inspection [n.d.]. Food Establishment Inspections. Retrieved from https://www1.nyc.gov/site/doh/services/restaurant-grades.page.Google ScholarGoogle Scholar
  37. RioOpenData [n.d.]. Portal de Armazenamento de Dados—Rio de Janeiro. Retrieved from http://www.data.rio.Google ScholarGoogle Scholar
  38. S. W. Roberts. 1959. Control chart tests based on geometric moving averages. Technometrics 1, 3 (1959), 239--250.Google ScholarGoogle ScholarCross RefCross Ref
  39. Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proc. VLDB 9, 4 (2015), 348--359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. San Francisco 2018. San Francisco Open Data. Retrieved from https://datasf.org/opendata/.Google ScholarGoogle Scholar
  41. Michael H. Schill, Ingrid Gould Ellen, Amy Ellen Schwartz, and Ioan Voicu. 2002. Revitalizing inner-city neighborhoods: New york city’s ten-year plan. Hous. Policy Debate 13, 3 (2002), 529--566.Google ScholarGoogle Scholar
  42. Leah Schinasi and Ghassan B. Hamra. 2017. A time series analysis of associations between daily temperature and crime events in philadelphia, pennsylvania. J. Urban Health 94, 6 (2017), 892--900.Google ScholarGoogle Scholar
  43. Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining quantitative association rules in large relational tables. In SIGMOD. 1--12.Google ScholarGoogle Scholar
  44. Sidney Tsang, Yun Sing Koh, and Gillian Dobbie. 2013. Finding interesting rare association rules using rare pattern tree. Trans. Large-Scale Data Knowl.-Center. Syst. 8 (2013), 157--173.Google ScholarGoogle Scholar
  45. Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A diagnostic tool for data errors. In SIGMOD. 1231--1245.Google ScholarGoogle Scholar
  46. wikipedia-sandy [n.d.]. Wikipedia entry on hurricane Sandy. Retrieved from https://en.wikipedia.org/wiki/Hurricane_Sandy.Google ScholarGoogle Scholar
  47. Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proc. VLDB 6, 8 (2013), 553--564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hyunyoon Yun, Danshim Ha, Buhyun Hwang, and Keun Ho Ryu. 2003. Mining association rules on significant rare data using relative support. J. Syst. Softw. 67, 3 (2003), 181--191.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Haopeng Zhang, Yanlei Diao, and Alexandra Meliou. 2017. EXstream: Explaining anomalies in event stream monitoring. In EDBT. 156--167.Google ScholarGoogle Scholar

Index Terms

  1. Effective Discovery of Meaningful Outlier Relationships

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM/IMS Transactions on Data Science
          ACM/IMS Transactions on Data Science  Volume 1, Issue 2
          May 2020
          169 pages
          ISSN:2691-1922
          DOI:10.1145/3403596
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 June 2020
          • Online AM: 7 May 2020
          • Accepted: 1 February 2020
          • Revised: 1 December 2019
          • Received: 1 October 2019
          Published in tds Volume 1, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!