skip to main content
research-article
Open Access

PlanAlyzer: assessing threats to the validity of online experiments

Published:10 October 2019Publication History
Skip Abstract Section

Abstract

Online experiments have become a ubiquitous aspect of design and engineering processes within Internet firms. As the scale of experiments has grown, so has the complexity of their design and implementation. In response, firms have developed software frameworks for designing and deploying online experiments. Ensuring that experiments in these frameworks are correctly designed and that their results are trustworthy---referred to as internal validity---can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.

We present the first approach for statically checking the internal validity of online experiments. Our checks are based on well-known problems that arise in experimental design and causal inference. Our analyses target PlanOut, a widely deployed, open-source experimentation framework that uses a domain-specific language to specify and run complex experiments. We have built a tool called PlanAlyzer that checks PlanOut programs for a variety of threats to internal validity, including failures of randomization, treatment assignment, and causal sufficiency. PlanAlyzer uses its analyses to automatically generate contrasts, a key type of information required to perform valid statistical analyses over the results of these experiments. We demonstrate PlanAlyzer's utility on a corpus of PlanOut scripts deployed in production at Facebook, and we evaluate its ability to identify threats to validity on a mutated subset of this corpus. PlanAlyzer has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.

Skip Supplemental Material Section

Supplemental Material

a182-tosch

Presentation at OOPSLA '19

References

  1. Douglas J Ahler, Carolyn E Roush, and Gaurav Sood. 2019. The Micro-Task Market for Lemons: Data Quality on Amazon’s Mechanical Turk. Presented at the Meeting of the Midwest Political Science Association.Google ScholarGoogle Scholar
  2. Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, Principles, Techniques. Addison Wesley, 75 Arlington Street, Suite 300 in Boston, Mass.Google ScholarGoogle Scholar
  3. Joshua D Angrist, Guido W Imbens, and Donald B Rubin. 1996. Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91, 434 (1996), 444–455.Google ScholarGoogle Scholar
  4. Eytan Bakshy, Dean Eckles, and Michael S. Bernstein. 2014. Designing and Deploying Online Field Experiments. In Proceedings of the 23rd International Conference on World Wide Web (WWW ’14) . ACM, New York, NY, USA, 283–292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Elias Bareinboim and Judea Pearl. 2015. Causal inference from big data: Theoretical foundations and the data-fusion problem. Technical Report. DTIC Document.Google ScholarGoogle Scholar
  6. Daniel W. Barowy, Emery D. Berger, Daniel G. Goldstein, and Siddharth Suri. 2017. VoxPL: Programming with the Wisdom of the Crowd. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). ACM, New York, NY, USA, 2347–2358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Daniel W. Barowy, Charlie Curtsinger, Emery D. Berger, and Andrew McGregor. 2012. AutoMan: A Platform for Integrating Human-based and Digital Computation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12) . ACM, New York, NY, USA, 639–654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovik, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century. Commun. ACM 51, 8 (Aug. 2008), 83–89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 1105–1114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jácome Cunha, Joao Paulo Fernandes, Pedro Martins, Jorge Mendes, and Joao Saraiva. 2012a. Smellsheet detective: A tool for detecting bad smells in spreadsheets. In 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) . IEEE, Piscataway, NJ, USA, 243–244.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jácome Cunha, João P Fernandes, Hugo Ribeiro, and João Saraiva. 2012b. Towards a catalog of spreadsheet smells. In International Conference on Computational Science and Its Applications . Springer, Berlin, Germany, 202–216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 4 (1991), 451–490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, Berlin, Germany, 337–340.Google ScholarGoogle ScholarCross RefCross Ref
  14. Dorothy E Denning. 1976. A lattice model of secure information flow. Commun. ACM 19, 5 (1976), 236–243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ramón Estruch, Emilio Ros, Jordi Salas-Salvadó, Maria-Isabel Covas, Dolores Corella, Fernando Arós, Enrique Gómez-Gracia, Valentina Ruiz-Gutiérrez, Miquel Fiol, José Lapetra, et al. 2013. Primary prevention of cardiovascular disease with a Mediterranean diet. New England Journal of Medicine 368, 14 (2013), 1279–1290.Google ScholarGoogle ScholarCross RefCross Ref
  16. Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319–349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J 1, 3923 (1936), 554–554.Google ScholarGoogle Scholar
  18. Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional, Boston, MA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Matthew Fredrikson and Somesh Jha. 2014. Satisfiability modulo counting: A new approach for analyzing privacy properties. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) . ACM, New York, NY, USA, 42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yoshihiko Futamura. 1999. Partial evaluation of computation process–an approach to a compiler-compiler. Higher-Order and Symbolic Computation 12, 4 (1999), 381–391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andrew Gelman and Gary King. 1990. Estimating incumbency advantage without bias. American Journal of Political Science 34, 4 (1990), 1142–1164.Google ScholarGoogle ScholarCross RefCross Ref
  22. Alan S Gerber and Donald P Green. 2012. Field experiments: Design, analysis, and interpretation. WW Norton, New York, NY, USA.Google ScholarGoogle Scholar
  23. Noah D Goodman, Vikash K Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2008. Church: A language for generative models. In Proc. 24th Conf. Uncertainty in Artificial Intelligence (UAI). JMLR: W&CP, Online, 220–229.Google ScholarGoogle Scholar
  24. Andrew D. Gordon, Thore Graepel, Nicolas Rolland, Claudio V. Russo, Johannes Borgström, and John Guiver. 2014a. Tabular: a schema-driven probabilistic programming language. In The 41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’14, San Diego, CA, USA, January 20-21, 2014 , Suresh Jagannathan and Peter Sewell (Eds.). ACM, New York, NY, USA, 321–334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andrew D Gordon, Thomas A Henzinger, Aditya V Nori, and Sriram K Rajamani. 2014b. Probabilistic programming. In Proceedings of the on Future of Software Engineering . ACM, New York, NY, USA, 167–181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ulrike Grönmping. 2014. R package FrF2 for creating and analyzing fractional factorial 2-level designs. Journal of Statistical Software 56, 1 (2014), 1–56.Google ScholarGoogle Scholar
  27. Ulrike Grönmping. 2016. FrF2: Fractional Factorial Designs with 2-Level Factors. http://CRAN.R-project.org/package=FrF2Google ScholarGoogle Scholar
  28. Ulrike Grönmping. 2017. CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data. http://CRAN.R-project.org/view=ExperimentalDesignGoogle ScholarGoogle Scholar
  29. Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s spreadsheets and related emails: A dataset and analysis. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , Vol. 2. IEEE, Piscataway, NJ, USA, 7–16.Google ScholarGoogle ScholarCross RefCross Ref
  30. Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting code smells in spreadsheet formulas. In 2012 28th IEEE International Conference on Software Maintenance (ICSM) . IEEE, Piscataway, NJ, USA, 409–418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting and refactoring code smells in spreadsheet formulas. Empirical Software Engineering 20, 2 (2015), 549–575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Miguel A Hernán and Jamie M Robins. 2016. Causal Inference. Forthcoming.Google ScholarGoogle Scholar
  33. David D Jensen, Andrew S Fast, Brian J Taylor, and Marc E Maier. 2008. Automatic identification of quasi-experimental designs for discovering causal knowledge. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 372–380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, 1168–1176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ron Kohavi and Roger Longbotham. 2015. Online controlled experiments and A/B tests. Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. Webb, Eds.Google ScholarGoogle Scholar
  36. Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.Google ScholarGoogle Scholar
  37. S Shunmuga Krishnan and Ramesh K Sitaraman. 2013. Video stream quality impacts viewer behavior: inferring causality using quasi-experimental designs. IEEE/ACM Transactions on Networking (TON) 21, 6 (2013), 2001–2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Roderick J Little and Donald B Rubin. 2000. Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches. Annual review of public health 21, 1 (2000), 121–145.Google ScholarGoogle Scholar
  39. Benjamin Livshits and George Kastrinis. 2014. Optimizing human computation to save time and money. Technical Report. Technical Report MSR-TR-2014-145, Microsoft Research.Google ScholarGoogle Scholar
  40. Miguel Angel Martinez-Gonzalez and Maira Bes-Rastrollo. 2014. Dietary patterns, Mediterranean diet, and cardiovascular disease. Current opinion in lipidology 25, 1 (2014), 20–26.Google ScholarGoogle Scholar
  41. Alison McCook. 2018. Errors Trigger Retraction Of Study On Mediterranean Diet’s Heart Benefits. https: //www.npr.org/sections/health-shots/2018/06/13/619619302/errors-trigger-retraction-of-study-on-mediterraneandiets-heart-benefits [Online; Last accessed 5 August 2019.].Google ScholarGoogle Scholar
  42. Tom Minka and John Winn. 2009. Gates. In Advances in Neural Information Processing Systems. JMLR: W&CP, Online, 1073–1080.Google ScholarGoogle Scholar
  43. T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2014. Infer.NET 2.6. Microsoft Research Cambridge. http://research.microsoft.com/infernet.Google ScholarGoogle Scholar
  44. Stephen L Morgan and Christopher Winship. 2014. Counterfactuals and causal inference. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  45. Steven S Muchnick. 1997. Advanced compiler design implementation. Morgan Kaufmann, Burlington, MA, USA.Google ScholarGoogle Scholar
  46. Susan A Murphy. 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 2 (2003), 331–355.Google ScholarGoogle ScholarCross RefCross Ref
  47. Raymond R Panko. 1998. What we know about spreadsheet errors. Journal of Organizational and End User Computing ( JOEUC) 10, 2 (1998), 15–21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. David C Parkes, Andrew Mao, Yiling Chen, Krzysztof Z Gajos, Ariel Procaccia, and Haoqi Zhang. 2012. Turkserver: Enabling synchronous and longitudinal online experiments. Fourth Workshop on Human Computation (HCOMP’12).Google ScholarGoogle Scholar
  49. Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press, New York, NY, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Avi Pfeffer. 2016. Practical probabilistic programming. Manning Publications Co., Shelter Island, NY, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.Google ScholarGoogle ScholarCross RefCross Ref
  52. A. Sabelfeld and A. C. Myers. 2003. Language-based information-flow security. IEEE Journal on Selected Areas in Communications 21, 1 (Jan 2003), 5–19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Amr Sabry and Matthias Felleisen. 1993. Reasoning about programs in continuation-passing style. Lisp and symbolic computation 6, 3-4 (1993), 289–360.Google ScholarGoogle Scholar
  54. John Sall. 1989. JMP: Design of Experiments. https://www.jmp.com/en_us/about.htmlGoogle ScholarGoogle Scholar
  55. Richard Scheines. 2003. Causal Reasoning: Disseminating New Curricula with Online Courseware. Presented at the American Education Research Association.Google ScholarGoogle Scholar
  56. Jasjeet S Sekhon. 2008. The Neyman-Rubin model of causal inference and estimation via matching methods. In The Oxford Handbook of Political Methodology . Oxford University Press, Oxford, UK, 271–299.Google ScholarGoogle Scholar
  57. Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. In ACM SIGSOFT Software Engineering Notes , Vol. 30. ACM, New York, NY, USA, 263–272. Issue 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. William R. Shadish, Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference . Houghton Mifflin Company, Boston, MA, USA.Google ScholarGoogle Scholar
  59. Alex Sherman, Philip A Lisiecki, Andy Berkheimer, and Joel Wein. 2005. ACMS: The Akamai configuration management system. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, Berkley, CA, USA, 245–258.Google ScholarGoogle Scholar
  60. Jaeho Shin, Andreas Paepcke, and Jennifer Widom. 2013. 3X: A Data Management System for Computational Experiments (Demonstration Proposal) . Technical Report. Stanford University. http://ilpubs.stanford.edu:8090/1080/Google ScholarGoogle Scholar
  61. Lisa B Signorello, Joseph K McLaughlin, Loren Lipworth, Søren Friis, Henrik Toft Sørensen, and William J Blot. 2002. Confounding by indication in epidemiologic studies of commonly used analgesics. American journal of therapeutics 9, 3 (2002), 199–205.Google ScholarGoogle Scholar
  62. American Cancer Society. 2019a. How Common is Breast Cancer? https://www.cancer.org/cancer/breast-cancer/about/howcommon-is-breast-cancer.html [Online; Last accessed 12 August 2019.].Google ScholarGoogle Scholar
  63. American Cancer Society. 2019b. Key Statistics for Breast Cancer in Men. https://www.cancer.org/cancer/breast-cancer-inmen/about/key-statistics.html [Online; Last accessed 12 August 2019.].Google ScholarGoogle Scholar
  64. Scott Spangler, Angela D Wilkins, Benjamin J Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R Pickering, Austin Comer, Jeffrey N Myers, et al. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 1877–1886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. David I Spivak. 2014. Category theory for the sciences. MIT Press, Cambridge, MA, USA.Google ScholarGoogle Scholar
  66. David I Spivak and Robert E Kent. 2012. Ologs: a categorical framework for knowledge representation. PLoS One 7, 1 (2012), e24274.Google ScholarGoogle ScholarCross RefCross Ref
  67. Jerzy Splawa-Neyman, DM Dabrowska, TP Speed, et al. 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. 5, 4 (1990), 465–472. [Updated 1900].Google ScholarGoogle ScholarCross RefCross Ref
  68. Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Automatic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 1633–1642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. 2015. Holistic configuration management at Facebook. In Proceedings of the 25th Symposium on Operating Systems Principles . ACM, New York, NY, USA, 328–343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 17–26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Santtu Tikka and Juha Karvanen. 2017. Identifying Causal Effects with the R Package. Journal of Statistical Software 76 (February 2017), 1–30. Issue 12.Google ScholarGoogle ScholarCross RefCross Ref
  72. Emma Tosch and Emery D. Berger. 2014. SurveyMan: Programming and Automatically Debugging Surveys. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14) . ACM, New York, NY, USA, 197–211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Leslie G Valiant. 1979. The complexity of computing the permanent. Theoretical computer science 8, 2 (1979), 189–201.Google ScholarGoogle Scholar
  74. Wei Wang, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. Forecasting elections with non-representative polls. International Journal of Forecasting 31, 3 (2015), 980–991.Google ScholarGoogle ScholarCross RefCross Ref
  75. Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A New Approach to Probabilistic Programming Inference. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Vol. 33. JMLR: W&CP, Online, 1024–1032.Google ScholarGoogle Scholar

Index Terms

  1. PlanAlyzer: assessing threats to the validity of online experiments

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Proceedings of the ACM on Programming Languages
              Proceedings of the ACM on Programming Languages  Volume 3, Issue OOPSLA
              October 2019
              2077 pages
              EISSN:2475-1421
              DOI:10.1145/3366395
              Issue’s Table of Contents

              Copyright © 2019 Owner/Author

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 10 October 2019
              Published in pacmpl Volume 3, Issue OOPSLA

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!