Abstract
Spreadsheets are one of the most widely used programming environments, and are widely deployed in domains like finance where errors can have catastrophic consequences. We present a static analysis specifically designed to find spreadsheet formula errors. Our analysis directly leverages the rectangular character of spreadsheets. It uses an information-theoretic approach to identify formulas that are especially surprising disruptions to nearby rectangular regions. We present ExceLint, an implementation of our static analysis for Microsoft Excel. We demonstrate that ExceLint is fast and effective: across a corpus of 70 spreadsheets, ExceLint takes a median of 8 seconds per spreadsheet, and it significantly outperforms the state of the art analysis.
Supplemental Material
- Robin Abraham and Martin Erwig. 2004. Header and unit inference for spreadsheets through spatial analyses. In Visual Languages and Human Centric Computing, 2004 IEEE Symposium on. IEEE, 165–172. Google Scholar
Digital Library
- Rui Abreu, Simon Außerlechner, Birgit Hofer, and Franz Wotawa. 2015a. Testing for Distinguishing Repair Candidates in Spreadsheets - the Mussco Approach. In Testing Software and Systems - 27th IFIP WG 6.1 International Conference, ICTSS 2015, Sharjah and Dubai, United Arab Emirates, November 23-25, 2015, Proceedings. 124–140. Google Scholar
Digital Library
- R. Abreu, J. Cunha, J. P. Fernandes, P. Martins, A. Perez, and J. Saraiva. 2014. Smelling Faults in Spreadsheets. In 2014 IEEE International Conference on Software Maintenance and Evolution. 111–120. Google Scholar
Digital Library
- Rui Abreu, Birgit Hofer, Alexandre Perez, and Franz Wotawa. 2015b. Using constraints to diagnose faulty spreadsheets. Software Quality Journal 23, 2 (2015), 297–322. Google Scholar
Digital Library
- Yanif Ahmad, Tudor Antoniu, Sharon Goldwater, and Shriram Krishnamurthi. 2003. A Type System for Statically Detecting Spreadsheet Errors. In ASE. IEEE Computer Society, 174–183. Google Scholar
Digital Library
- Abdussalam Alawini, David Maier, Kristin Tufte, Bill Howe, and Rashmi Nandikur. 2015. Towards Automated Prediction of Relationships Among Scientific Datasets. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM ’15). ACM, New York, NY, USA, Article 35, 5 pages. Google Scholar
Digital Library
- Tudor Antoniu, Paul A. Steckler, Shriram Krishnamurthi, Erich Neuwirth, and Matthias Felleisen. 2004. Validating the Unit Correctness of Spreadsheet Programs. In Proceedings of the 26th International Conference on Software Engineering (ICSE ’04). IEEE Computer Society, Washington, DC, USA, 439–448. http://dl.acm.org/citation.cfm?id =998675.999448 Google Scholar
Digital Library
- Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson R. Murphy-Hill. 2015. Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets. In 12th IEEE/ACM Working Conference on Mining Software Repositories, MSR 2015, Florence, Italy, May 16-17, 2015. 486–489. Google Scholar
Digital Library
- Daniel W. Barowy, Emery D. Berger, and Benjamin Zorn. 2018. ExceLint repository. https://github.com/excelint/excelint. (2018).Google Scholar
- Daniel W. Barowy, Dimitar Gochev, and Emery D. Berger. 2014. CheckCell: Data Debugging for Spreadsheets. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). ACM, New York, NY, USA, 507–523. Google Scholar
Digital Library
- Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin Zorn. 2015. FlashRelate: Extracting Relational Data from Semistructured Spreadsheets Using Examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’15). ACM, New York, NY, USA, 218–228. Google Scholar
Digital Library
- Michael Batty. 1974. Spatial Entropy. Geographical Analysis 6, 1 (1974), 1–31.Google Scholar
Cross Ref
- Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World. Commun. ACM 53, 2 (Feb. 2010), 66–75. Google Scholar
Digital Library
- Jeffrey Carver, Marc Fisher, II, and Gregg Rothermel. 2006. An empirical evaluation of a testing and debugging methodology for Excel. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering (ISESE ’06). ACM, New York, NY, USA, 278–287. Google Scholar
Digital Library
- Chris Chambers and Martin Erwig. 2010. Reasoning about spreadsheets with labels and dimensions. J. Vis. Lang. Comput. 21, 5 (Dec. 2010), 249–262. Google Scholar
Digital Library
- J.P. Morgan Chase and Co. 2013. Report of JPMorgan Chase and Co. Management Task Force Regarding 2012 CIO Losses. (16 Jan. 2013). http://files.shareholder.com/downloads/ONE/5509659956x0x628656/4cb574a0-0bf5-4728-9582-625e4519b5ab/ Task F orce R eport.pdfGoogle Scholar
- Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection using Strong and Weak Features. In Proceedings of ICSE ’16. to appear. Google Scholar
Digital Library
- Trishul M. Chilimbi and Vinod Ganapathy. 2006. HeapMD: Identifying Heap-based Bugs Using Anomaly Detection. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 219–228. Google Scholar
Digital Library
- Keith D. Cooper and Linda Torczon. 2005. Engineering a Compiler. Morgan Kaufmann.Google Scholar
- Martin Dimitrov and Huiyang Zhou. 2009. Anomaly-based Bug Prediction, Isolation, and Validation: An Automated Approach for Software Debugging. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIV). ACM, New York, NY, USA, 61–72. Google Scholar
Digital Library
- Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation. In Proceedings of the 36th International Conference on Software Engineering. ACM, 848–858. Google Scholar
Digital Library
- Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. 2001. Bugs As Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (SOSP ’01). ACM, New York, NY, USA, 57–72. Google Scholar
Digital Library
- Martin Erwig. 2009. Software Engineering for Spreadsheets. IEEE Softw. 26, 5 (Sept. 2009), 25–30. Google Scholar
Digital Library
- Martin Erwig, Robin Abraham, Irene Cooperstein, and Steve Kollmansberger. 2005. Automatic generation and maintenance of correct spreadsheets. In ICSE (ICSE ’05). ACM, New York, NY, USA, 136–145. Google Scholar
Digital Library
- Martin Erwig and Margaret Burnett. 2002. Adding apples and oranges. In Practical Aspects of Declarative Languages. Springer, 173–191. Google Scholar
Digital Library
- Marc Fisher and Gregg Rothermel. 2005. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. SIGSOFT Softw. Eng. Notes (July 2005). Google Scholar
Digital Library
- M. Fisher, G. Rothermel, T. Creelan, and M. Burnett. 2006. Scaling a Dataflow Testing Methodology to the Multiparadigm World of Commercial Spreadsheets. In 17th International Symposium on Software Reliability Engineering (ISSRE’06). IEEE, 13–22. Google Scholar
Digital Library
- Mary Jo Foley. 2010. About that 1 billion Microsoft Office figure ... http://www.zdnet.com/article/about-that-1-billionmicrosoft-office-figure . (16 June 2010).Google Scholar
- Valentina I. Grigoreanu, Margaret M. Burnett, and George G. Robertson. 2010. A Strategy-centric Approach to the Design of End-user Debugging Tools. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10). ACM, New York, NY, USA, 713–722. Google Scholar
Digital Library
- Sudheendra Hangal and Monica S. Lam. 2002. Tracking Down Software Bugs Using Automatic Anomaly Detection. In Proceedings of the 24th International Conference on Software Engineering (ICSE ’02). ACM, New York, NY, USA, 291–301. Google Scholar
Digital Library
- Felienne Hermans and Danny Dig. 2014. BumbleBee: A Refactoring Environment for Spreadsheet Formulas. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 747–750. Google Scholar
Digital Library
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012a. Detecting and Visualizing Inter-worksheet Smells in Spreadsheets. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA, 441–451. http://dl.acm.org/citation.cfm?id =2337223.2337275 Google Scholar
Digital Library
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2010. Automatically Extracting Class Diagrams from Spreadsheets. In Proceedings of the 24th European Conference on Object-oriented Programming (ECOOP’10). Springer-Verlag, Berlin, Heidelberg, 52–75. http://dl.acm.org/citation.cfm?id =1883978.1883984 Google Scholar
Digital Library
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012b. Detecting code smells in spreadsheet formulas. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on. IEEE, 409–418. Google Scholar
Digital Library
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting and refactoring code smells in spreadsheet formulas. Empirical Software Engineering 20, 2 (01 Apr 2015), 549–575. Google Scholar
Digital Library
- Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 292–301. http://dl.acm.org/citation.cfm?id =2486788.2486827 Google Scholar
Digital Library
- Thomas Herndon, Michael Ash, and Robert Pollin. 2013. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Working Paper Series 322. Political Economy Research Institute, University of Massachusetts Amherst. http://www.peri.umass.edu/fileadmin/pdf/working p apers/working p apers 3 01-350/WP322.pdfGoogle Scholar
- Birgit Hofer, Andrea Hofler, and Franz Wotawa. 2017. Combining Models for Improved Fault Localization in Spreadsheets. IEEE Trans. Reliability 66, 1 (2017), 38–53.Google Scholar
Cross Ref
- Birgit Hofer, Alexandre Perez, Rui Abreu, and Franz Wotawa. 2015. On the empirical evaluation of similarity coefficients for spreadsheets fault localization. Autom. Softw. Eng. 22, 1 (2015), 47–74. Google Scholar
Digital Library
- Birgit Hofer, André Riboira, Franz Wotawa, Rui Abreu, and Elisabeth Getzner. 2013. On the empirical evaluation of fault localization techniques for spreadsheets. In Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering (FASE’13). Springer-Verlag, Berlin, Heidelberg, 68–82. Google Scholar
Digital Library
- Dietmar Jannach, Thomas Schmitz, Birgit Hofer, and Franz Wotawa. 2014. Avoiding, finding and fixing spreadsheet errors - A survey of automated approaches for spreadsheet QA. Journal of Systems and Software 94 (2014), 129–150.Google Scholar
- Nima Joharizadeh. 2015. Finding Bugs in Spreadsheets Using Reference Counting. In Companion Proceedings of the 2015 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH Companion 2015). ACM, New York, NY, USA, 73–74. Google Scholar
Digital Library
- Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 2011. The state of the art in end-user software engineering. ACM Comput. Surv. 43, 3, Article 21 (April 2011), 44 pages. Google Scholar
Digital Library
- Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 542–553. Google Scholar
Digital Library
- Gaspard Monge. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences (1781), 666–704.Google Scholar
- Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou. 2015. Preventing Data Errors with Continuous Testing. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA 2015). ACM, New York, NY, USA, 373–384. Google Scholar
Digital Library
- Ray Panko. 2015. What We Don’t Know About Spreadsheet Errors Today: The Facts, Why We Don’t Believe Them, and What We Need to Do. In The European Spreadsheet Risks Interest Group 16th Annual Conference (EuSpRiG 2015). EuSpRiG.Google Scholar
- Raymond R. Panko. 1998. What we know about spreadsheet errors. Journal of End User Computing 10 (1998), 15–21. Google Scholar
Digital Library
- Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and Improving Fault Localization. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 609–620. Google Scholar
Digital Library
- J. R. Quinlan. 1986. Induction of Decision Trees. MACH. LEARN 1 (1986), 81–106. Google Scholar
Digital Library
- Orna Raz, Philip Koopman, and Mary Shaw. 2002. Semantic anomaly detection in online data sources. In ICSE (ICSE ’02). ACM, New York, NY, USA, 302–312. Google Scholar
Digital Library
- Carmen M. Reinhart and Kenneth S. Rogoff. 2010. Growth in a Time of Debt. Working Paper 15639. National Bureau of Economic Research. http://www.nber.org/papers/w15639Google Scholar
- G. Rothermel, M. Burnett, L. Li, C. Dupuis, and A. Sheretov. 2001. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 1 (2001), 110–147. Google Scholar
Digital Library
- G. Rothermel, L. Li, C. DuPuis, and M. Burnett. 1998. What you see is what you test: A methodology for testing form-based visual programs. In ICSE 1998. IEEE, 198–207. Google Scholar
Digital Library
- Thomas Schmitz, Dietmar Jannach, Birgit Hofer, Patrick W. Koch, Konstantin Schekotihin, and Franz Wotawa. 2017. A decomposition-based approach to spreadsheet testing and debugging. In 2017 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2017, Raleigh, NC, USA, October 11-14, 2017. 117–121.Google Scholar
Cross Ref
- C. E. Shannon. 1948. A mathematical theory of communication. Bell system technical journal 27 (1948).Google Scholar
- Rishabh Singh, Benjamin Livshits, and Ben Zorn. 2017. Melford: Using Neural Networks to Find Spreadsheet Errors. Technical Report. https://www.microsoft.com/en-us/research/publication/melford-using-neural-networks-find-spreadsheeterrors/Google Scholar
- Peter Wegner. 1960. A Technique for Counting Ones in a Binary Computer. Commun. ACM 3, 5 (May 1960), 322–. Google Scholar
Digital Library
- D. J. A. Welsh and M. B. Powell. 1967. An upper bound for the chromatic number of a graph and its application to timetabling problems. Comput. J. 10, 1 (1967), 85–86.Google Scholar
- Yichen Xie and Dawson Engler. 2002. Using Redundancies to Find Errors. In IEEE Transactions on Software Engineering. 51–60. Google Scholar
Digital Library
Index Terms
ExceLint: automatically finding spreadsheet formula errors






Comments