skip to main content
research-article

CheckCell: data debugging for spreadsheets

Published:15 October 2014Publication History
Skip Abstract Section

Abstract

Testing and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to automatically find potential data errors. Since it is impossible to know a priori whether data are erroneous, data debugging instead locates data that has a disproportionate impact on the computation. Such data is either very important, or wrong. Data debugging is especially useful in the context of data-intensive programming environments that intertwine data with programs in the form of queries or formulas.

We present the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell identifies cells that have an unusually high impact on the spreadsheet's computations. We show that CheckCell is both analytically and empirically fast and effective. We show that it successfully finds injected typographical errors produced by a generative model trained with data entry from 169,112 Mechanical Turk tasks. CheckCell is more precise and efficient than standard outlier detection techniques. CheckCell also automatically identifies a key flaw in the infamous Reinhart and Rogoff spreadsheet.

References

  1. Y. Ahmad, T. Antoniu, S. Goldwater, and S. Krishnamurthi. A type system for statically detecting spreadsheet errors. In ASE, pages 174--183. IEEE Computer Society, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Ait-Ameur, G. Bel, F. Boniol, S. Pairault, and V. Wiels. Robustness analysis of avionics embedded systems. SIGPLAN Not., 38(7):123--132, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Antoniu, P. A. Steckler, S. Krishnamurthi, E. Neuwirth, and M. Felleisen. Validating the unit correctness of spreadsheet programs. In Proceedings of the 26th International Conference on Software Engineering, ICSE '04, pages 439--448, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache Foundation. Welcome to Apache Hadoop. http://hadoop.apache.org/, Nov. 2012.Google ScholarGoogle Scholar
  5. M. Ash and R. Pollin. Supplemental Technical Critique of Reinhart and Rogoff, "Growth in a Time of Debt". Research brief, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.Google ScholarGoogle Scholar
  6. A. Balmin, T. Papadimitriou, and Y. Papakonstantinou. Hypothetical queries in an OLAP environment. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 220--231, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Barnett and T. Lewis. Outliers in statistical data. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1994, 3rd ed., 1, 1994.Google ScholarGoogle Scholar
  8. J. Carver, M. Fisher, II, and G. Rothermel. An empirical evaluation of a testing and debugging methodology for excel. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, ISESE '06, pages 278--287, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Chambers and M. Erwig. Reasoning about spreadsheets with labels and dimensions. J. Vis. Lang. Comput., 21(5):249--262, Dec. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, Mar. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):pp. 1--26, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Ernst, J. Perkins, P. Guo, S. McCamant, C. Pacheco, M. Tschantz, and C. Xiao. The daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1):35--45, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Erwig. Software engineering for spreadsheets. IEEE Softw., 26(5):25--30, Sept. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Erwig, R. Abraham, I. Cooperstein, and S. Kollmansberger. Automatic generation and maintenance of correct spreadsheets. In ICSE, ICSE '05, pages 136--145, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Fisher and G. Rothermel. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. SIGSOFT Softw. Eng. Notes, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Fisher, G. Rothermel, T. Creelan, and M. Burnett. Scaling a dataflow testing methodology to the multiparadigm world of commercial spreadsheets. In 17th International Symposium on Software Reliability Engineering (ISSRE'06), pages 13--22. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, SIGMOD '00, page 590, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: exploring data quality and semantics using pattern tableaux. Proc. VLDB Endow., 3(1-2):1641--1644, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Gulwani. Automating string processing in spreadsheets using input-output examples. In T. Ball and M. Sagiv, editors, POPL, pages 317--330. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Hamlet. Continuity in software systems. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA '02, pages 196--200, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In M. W. Hall and D. A. Padua, editors, PLDI, pages 317--328. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google ScholarGoogle Scholar
  25. M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 127--138, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Herndon, M. Ash, and R. Pollin. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Working Paper Series 322, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.Google ScholarGoogle Scholar
  27. B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and E. Getzner. On the empirical evaluation of fault localization techniques for spreadsheets. In Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering, FASE'13, pages 68--82, Berlin, Heidelberg, 2013. Springer- Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, 5(3):299--314, 1996.Google ScholarGoogle Scholar
  29. S. Jeffery, G. Alonso, M. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), pages 140--142, Apr. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. J. Ko, R. Abraham, L. Beckwith, A. Blackwell, M. Burnett, M. Erwig, C. Scaffidi, J. Lawrance, H. Lieberman, B. Myers, M. B. Rosson, G. Rothermel, M. Shaw, and S. Wiedenbeck. The state of the art in end-user software engineering. ACM Comput. Surv., 43(3):21:1--21:44, Apr. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Luebbers, U. Grimmer, and M. Jarke. Systematic development of data mining-based data quality tools. In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB '03, pages 548--559. VLDB Endowment, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  33. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. O. Raz, P. Koopman, and M. Shaw. Semantic anomaly detection in online data sources. In ICSE, ICSE '02, pages 302--312, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. Working Paper 15639, National Bureau of Economic Research, January 2010.Google ScholarGoogle ScholarCross RefCross Ref
  36. C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. The American Economic Review, 100(2):573--78, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  37. G. Rothermel, M. Burnett, L. Li, C. Dupuis, and A. Sheretov. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM), 10(1):110--147, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Rothermel, L. Li, C. DuPuis, and M. Burnett. What you see is what you test: A methodology for testing form-based visual programs. In ICSE 1998, pages 198--207. IEEE, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Sakal and L. Raković. Errors in building and using electronic tables: Financial consequences and minimisation techiques. International Journal of Strategic Management and Decision Support Systems in Strategic Management, 17(3):29--35, 2012.Google ScholarGoogle Scholar
  40. V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.Google ScholarGoogle Scholar
  41. R. Singh and S. Gulwani. Learning semantic string transformations from examples. Proc. VLDB Endow., 5(8):740--751, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. H. Xiong, G. Pandey, M. Steinbach, and V. Kumar. Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering, 18(3):304--319, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Zhang and W. Su. Statistical inference on recall, precision and average precision under random selection. In FSKD, pages 1348--1352. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CheckCell: data debugging for spreadsheets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 49, Issue 10
        OOPSLA '14
        October 2014
        907 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2714064
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
          October 2014
          946 pages
          ISBN:9781450325851
          DOI:10.1145/2660193

        Copyright © 2014 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2014

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!