Abstract
Testing and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to automatically find potential data errors. Since it is impossible to know a priori whether data are erroneous, data debugging instead locates data that has a disproportionate impact on the computation. Such data is either very important, or wrong. Data debugging is especially useful in the context of data-intensive programming environments that intertwine data with programs in the form of queries or formulas.
We present the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell identifies cells that have an unusually high impact on the spreadsheet's computations. We show that CheckCell is both analytically and empirically fast and effective. We show that it successfully finds injected typographical errors produced by a generative model trained with data entry from 169,112 Mechanical Turk tasks. CheckCell is more precise and efficient than standard outlier detection techniques. CheckCell also automatically identifies a key flaw in the infamous Reinhart and Rogoff spreadsheet.
- Y. Ahmad, T. Antoniu, S. Goldwater, and S. Krishnamurthi. A type system for statically detecting spreadsheet errors. In ASE, pages 174--183. IEEE Computer Society, 2003.Google Scholar
Digital Library
- Y. Ait-Ameur, G. Bel, F. Boniol, S. Pairault, and V. Wiels. Robustness analysis of avionics embedded systems. SIGPLAN Not., 38(7):123--132, June 2003. Google Scholar
Digital Library
- T. Antoniu, P. A. Steckler, S. Krishnamurthi, E. Neuwirth, and M. Felleisen. Validating the unit correctness of spreadsheet programs. In Proceedings of the 26th International Conference on Software Engineering, ICSE '04, pages 439--448, Washington, DC, USA, 2004. IEEE Computer Society. Google Scholar
Digital Library
- Apache Foundation. Welcome to Apache Hadoop. http://hadoop.apache.org/, Nov. 2012.Google Scholar
- M. Ash and R. Pollin. Supplemental Technical Critique of Reinhart and Rogoff, "Growth in a Time of Debt". Research brief, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.Google Scholar
- A. Balmin, T. Papadimitriou, and Y. Papakonstantinou. Hypothetical queries in an OLAP environment. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 220--231, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. Google Scholar
Digital Library
- V. Barnett and T. Lewis. Outliers in statistical data. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1994, 3rd ed., 1, 1994.Google Scholar
- J. Carver, M. Fisher, II, and G. Rothermel. An empirical evaluation of a testing and debugging methodology for excel. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, ISESE '06, pages 278--287, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
- C. Chambers and M. Erwig. Reasoning about spreadsheets with labels and dimensions. J. Vis. Lang. Comput., 21(5):249--262, Dec. 2010. Google Scholar
Digital Library
- S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, Mar. 1997. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):pp. 1--26, 1979.Google Scholar
Cross Ref
- M. Ernst, J. Perkins, P. Guo, S. McCamant, C. Pacheco, M. Tschantz, and C. Xiao. The daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1):35--45, 2007. Google Scholar
Digital Library
- M. Erwig. Software engineering for spreadsheets. IEEE Softw., 26(5):25--30, Sept. 2009. Google Scholar
Digital Library
- M. Erwig, R. Abraham, I. Cooperstein, and S. Kollmansberger. Automatic generation and maintenance of correct spreadsheets. In ICSE, ICSE '05, pages 136--145, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- M. Fisher and G. Rothermel. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. SIGSOFT Softw. Eng. Notes, July 2005. Google Scholar
Digital Library
- M. Fisher, G. Rothermel, T. Creelan, and M. Burnett. Scaling a dataflow testing methodology to the multiparadigm world of commercial spreadsheets. In 17th International Symposium on Software Reliability Engineering (ISSRE'06), pages 13--22. IEEE, 2006. Google Scholar
Digital Library
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, SIGMOD '00, page 590, New York, NY, USA, 2000. ACM. Google Scholar
Digital Library
- L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: exploring data quality and semantics using pattern tableaux. Proc. VLDB Endow., 3(1-2):1641--1644, Sept. 2010. Google Scholar
Digital Library
- S. Gulwani. Automating string processing in spreadsheets using input-output examples. In T. Ball and M. Sagiv, editors, POPL, pages 317--330. ACM, 2011. Google Scholar
Digital Library
- D. Hamlet. Continuity in software systems. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA '02, pages 196--200, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006. Google Scholar
Digital Library
- W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In M. W. Hall and D. A. Padua, editors, PLDI, pages 317--328. ACM, 2011. Google Scholar
Digital Library
- J. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google Scholar
- M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 127--138, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- T. Herndon, M. Ash, and R. Pollin. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Working Paper Series 322, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.Google Scholar
- B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and E. Getzner. On the empirical evaluation of fault localization techniques for spreadsheets. In Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering, FASE'13, pages 68--82, Berlin, Heidelberg, 2013. Springer- Verlag. Google Scholar
Digital Library
- R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, 5(3):299--314, 1996.Google Scholar
- S. Jeffery, G. Alonso, M. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), pages 140--142, Apr. 2006. Google Scholar
Digital Library
- A. J. Ko, R. Abraham, L. Beckwith, A. Blackwell, M. Burnett, M. Erwig, C. Scaffidi, J. Lawrance, H. Lieberman, B. Myers, M. B. Rosson, G. Rothermel, M. Shaw, and S. Wiedenbeck. The state of the art in end-user software engineering. ACM Comput. Surv., 43(3):21:1--21:44, Apr. 2011. Google Scholar
Digital Library
- D. Luebbers, U. Grimmer, and M. Jarke. Systematic development of data mining-based data quality tools. In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB '03, pages 548--559. VLDB Endowment, 2003. Google Scholar
Digital Library
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google Scholar
Digital Library
- O. Raz, P. Koopman, and M. Shaw. Semantic anomaly detection in online data sources. In ICSE, ICSE '02, pages 302--312, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. Working Paper 15639, National Bureau of Economic Research, January 2010.Google Scholar
Cross Ref
- C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. The American Economic Review, 100(2):573--78, 2010.Google Scholar
Cross Ref
- G. Rothermel, M. Burnett, L. Li, C. Dupuis, and A. Sheretov. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM), 10(1):110--147, 2001. Google Scholar
Digital Library
- G. Rothermel, L. Li, C. DuPuis, and M. Burnett. What you see is what you test: A methodology for testing form-based visual programs. In ICSE 1998, pages 198--207. IEEE, 1998. Google Scholar
Digital Library
- M. Sakal and L. Raković. Errors in building and using electronic tables: Financial consequences and minimisation techiques. International Journal of Strategic Management and Decision Support Systems in Strategic Management, 17(3):29--35, 2012.Google Scholar
- V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.Google Scholar
- R. Singh and S. Gulwani. Learning semantic string transformations from examples. Proc. VLDB Endow., 5(8):740--751, Apr. 2012. Google Scholar
Digital Library
- H. Xiong, G. Pandey, M. Steinbach, and V. Kumar. Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering, 18(3):304--319, Mar. 2006. Google Scholar
Digital Library
- P. Zhang and W. Su. Statistical inference on recall, precision and average precision under random selection. In FSKD, pages 1348--1352. IEEE, 2012.Google Scholar
Cross Ref
Index Terms
CheckCell: data debugging for spreadsheets
Recommendations
CheckCell: data debugging for spreadsheets
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & ApplicationsTesting and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to automatically find potential data errors. Since it is ...
Debugging by asking questions about program output
ICSE '06: Proceedings of the 28th international conference on Software engineeringOne reason debugging is the most time-consuming part of software development is because developers struggle to map their questions about a program's behavior onto debugging tools' limited support for analyzing code. Interrogative debugging is a new ...
A mental model perspective for tool development and paradigm shift in spreadsheets
To address the problem of errors in spreadsheets, we have investigated spreadsheet authors' mental models in a hope of finding cognition-based principles for spreadsheet visualization and debugging tools. To this end, we have conducted three empirical ...







Comments