Abstract
Online experiments have become a ubiquitous aspect of design and engineering processes within Internet firms. As the scale of experiments has grown, so has the complexity of their design and implementation. In response, firms have developed software frameworks for designing and deploying online experiments. Ensuring that experiments in these frameworks are correctly designed and that their results are trustworthy---referred to as internal validity---can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.
We present the first approach for statically checking the internal validity of online experiments. Our checks are based on well-known problems that arise in experimental design and causal inference. Our analyses target PlanOut, a widely deployed, open-source experimentation framework that uses a domain-specific language to specify and run complex experiments. We have built a tool called PlanAlyzer that checks PlanOut programs for a variety of threats to internal validity, including failures of randomization, treatment assignment, and causal sufficiency. PlanAlyzer uses its analyses to automatically generate contrasts, a key type of information required to perform valid statistical analyses over the results of these experiments. We demonstrate PlanAlyzer's utility on a corpus of PlanOut scripts deployed in production at Facebook, and we evaluate its ability to identify threats to validity on a mutated subset of this corpus. PlanAlyzer has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.
Supplemental Material
- Douglas J Ahler, Carolyn E Roush, and Gaurav Sood. 2019. The Micro-Task Market for Lemons: Data Quality on Amazon’s Mechanical Turk. Presented at the Meeting of the Midwest Political Science Association.Google Scholar
- Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, Principles, Techniques. Addison Wesley, 75 Arlington Street, Suite 300 in Boston, Mass.Google Scholar
- Joshua D Angrist, Guido W Imbens, and Donald B Rubin. 1996. Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91, 434 (1996), 444–455.Google Scholar
- Eytan Bakshy, Dean Eckles, and Michael S. Bernstein. 2014. Designing and Deploying Online Field Experiments. In Proceedings of the 23rd International Conference on World Wide Web (WWW ’14) . ACM, New York, NY, USA, 283–292. Google Scholar
Digital Library
- Elias Bareinboim and Judea Pearl. 2015. Causal inference from big data: Theoretical foundations and the data-fusion problem. Technical Report. DTIC Document.Google Scholar
- Daniel W. Barowy, Emery D. Berger, Daniel G. Goldstein, and Siddharth Suri. 2017. VoxPL: Programming with the Wisdom of the Crowd. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). ACM, New York, NY, USA, 2347–2358. Google Scholar
Digital Library
- Daniel W. Barowy, Charlie Curtsinger, Emery D. Berger, and Andrew McGregor. 2012. AutoMan: A Platform for Integrating Human-based and Digital Computation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12) . ACM, New York, NY, USA, 639–654. Google Scholar
Digital Library
- Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovik, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century. Commun. ACM 51, 8 (Aug. 2008), 83–89. Google Scholar
Digital Library
- Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 1105–1114.Google Scholar
Digital Library
- Jácome Cunha, Joao Paulo Fernandes, Pedro Martins, Jorge Mendes, and Joao Saraiva. 2012a. Smellsheet detective: A tool for detecting bad smells in spreadsheets. In 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) . IEEE, Piscataway, NJ, USA, 243–244.Google Scholar
Cross Ref
- Jácome Cunha, João P Fernandes, Hugo Ribeiro, and João Saraiva. 2012b. Towards a catalog of spreadsheet smells. In International Conference on Computational Science and Its Applications . Springer, Berlin, Germany, 202–216.Google Scholar
Digital Library
- Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 4 (1991), 451–490.Google Scholar
Digital Library
- Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, Berlin, Germany, 337–340.Google Scholar
Cross Ref
- Dorothy E Denning. 1976. A lattice model of secure information flow. Commun. ACM 19, 5 (1976), 236–243.Google Scholar
Digital Library
- Ramón Estruch, Emilio Ros, Jordi Salas-Salvadó, Maria-Isabel Covas, Dolores Corella, Fernando Arós, Enrique Gómez-Gracia, Valentina Ruiz-Gutiérrez, Miquel Fiol, José Lapetra, et al. 2013. Primary prevention of cardiovascular disease with a Mediterranean diet. New England Journal of Medicine 368, 14 (2013), 1279–1290.Google Scholar
Cross Ref
- Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319–349.Google Scholar
Digital Library
- Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J 1, 3923 (1936), 554–554.Google Scholar
- Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional, Boston, MA, USA.Google Scholar
Digital Library
- Matthew Fredrikson and Somesh Jha. 2014. Satisfiability modulo counting: A new approach for analyzing privacy properties. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) . ACM, New York, NY, USA, 42.Google Scholar
Digital Library
- Yoshihiko Futamura. 1999. Partial evaluation of computation process–an approach to a compiler-compiler. Higher-Order and Symbolic Computation 12, 4 (1999), 381–391.Google Scholar
Digital Library
- Andrew Gelman and Gary King. 1990. Estimating incumbency advantage without bias. American Journal of Political Science 34, 4 (1990), 1142–1164.Google Scholar
Cross Ref
- Alan S Gerber and Donald P Green. 2012. Field experiments: Design, analysis, and interpretation. WW Norton, New York, NY, USA.Google Scholar
- Noah D Goodman, Vikash K Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2008. Church: A language for generative models. In Proc. 24th Conf. Uncertainty in Artificial Intelligence (UAI). JMLR: W&CP, Online, 220–229.Google Scholar
- Andrew D. Gordon, Thore Graepel, Nicolas Rolland, Claudio V. Russo, Johannes Borgström, and John Guiver. 2014a. Tabular: a schema-driven probabilistic programming language. In The 41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’14, San Diego, CA, USA, January 20-21, 2014 , Suresh Jagannathan and Peter Sewell (Eds.). ACM, New York, NY, USA, 321–334. Google Scholar
Digital Library
- Andrew D Gordon, Thomas A Henzinger, Aditya V Nori, and Sriram K Rajamani. 2014b. Probabilistic programming. In Proceedings of the on Future of Software Engineering . ACM, New York, NY, USA, 167–181.Google Scholar
Digital Library
- Ulrike Grönmping. 2014. R package FrF2 for creating and analyzing fractional factorial 2-level designs. Journal of Statistical Software 56, 1 (2014), 1–56.Google Scholar
- Ulrike Grönmping. 2016. FrF2: Fractional Factorial Designs with 2-Level Factors. http://CRAN.R-project.org/package=FrF2Google Scholar
- Ulrike Grönmping. 2017. CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data. http://CRAN.R-project.org/view=ExperimentalDesignGoogle Scholar
- Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s spreadsheets and related emails: A dataset and analysis. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , Vol. 2. IEEE, Piscataway, NJ, USA, 7–16.Google Scholar
Cross Ref
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting code smells in spreadsheet formulas. In 2012 28th IEEE International Conference on Software Maintenance (ICSM) . IEEE, Piscataway, NJ, USA, 409–418.Google Scholar
Digital Library
- Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting and refactoring code smells in spreadsheet formulas. Empirical Software Engineering 20, 2 (2015), 549–575.Google Scholar
Digital Library
- Miguel A Hernán and Jamie M Robins. 2016. Causal Inference. Forthcoming.Google Scholar
- David D Jensen, Andrew S Fast, Brian J Taylor, and Marc E Maier. 2008. Automatic identification of quasi-experimental designs for discovering causal knowledge. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 372–380.Google Scholar
Digital Library
- Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, 1168–1176.Google Scholar
Digital Library
- Ron Kohavi and Roger Longbotham. 2015. Online controlled experiments and A/B tests. Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. Webb, Eds.Google Scholar
- Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.Google Scholar
- S Shunmuga Krishnan and Ramesh K Sitaraman. 2013. Video stream quality impacts viewer behavior: inferring causality using quasi-experimental designs. IEEE/ACM Transactions on Networking (TON) 21, 6 (2013), 2001–2014.Google Scholar
Digital Library
- Roderick J Little and Donald B Rubin. 2000. Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches. Annual review of public health 21, 1 (2000), 121–145.Google Scholar
- Benjamin Livshits and George Kastrinis. 2014. Optimizing human computation to save time and money. Technical Report. Technical Report MSR-TR-2014-145, Microsoft Research.Google Scholar
- Miguel Angel Martinez-Gonzalez and Maira Bes-Rastrollo. 2014. Dietary patterns, Mediterranean diet, and cardiovascular disease. Current opinion in lipidology 25, 1 (2014), 20–26.Google Scholar
- Alison McCook. 2018. Errors Trigger Retraction Of Study On Mediterranean Diet’s Heart Benefits. https: //www.npr.org/sections/health-shots/2018/06/13/619619302/errors-trigger-retraction-of-study-on-mediterraneandiets-heart-benefits [Online; Last accessed 5 August 2019.].Google Scholar
- Tom Minka and John Winn. 2009. Gates. In Advances in Neural Information Processing Systems. JMLR: W&CP, Online, 1073–1080.Google Scholar
- T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2014. Infer.NET 2.6. Microsoft Research Cambridge. http://research.microsoft.com/infernet.Google Scholar
- Stephen L Morgan and Christopher Winship. 2014. Counterfactuals and causal inference. Cambridge University Press, Cambridge, UK.Google Scholar
- Steven S Muchnick. 1997. Advanced compiler design implementation. Morgan Kaufmann, Burlington, MA, USA.Google Scholar
- Susan A Murphy. 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 2 (2003), 331–355.Google Scholar
Cross Ref
- Raymond R Panko. 1998. What we know about spreadsheet errors. Journal of Organizational and End User Computing ( JOEUC) 10, 2 (1998), 15–21.Google Scholar
Digital Library
- David C Parkes, Andrew Mao, Yiling Chen, Krzysztof Z Gajos, Ariel Procaccia, and Haoqi Zhang. 2012. Turkserver: Enabling synchronous and longitudinal online experiments. Fourth Workshop on Human Computation (HCOMP’12).Google Scholar
- Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press, New York, NY, USA.Google Scholar
Digital Library
- Avi Pfeffer. 2016. Practical probabilistic programming. Manning Publications Co., Shelter Island, NY, USA.Google Scholar
Digital Library
- Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.Google Scholar
Cross Ref
- A. Sabelfeld and A. C. Myers. 2003. Language-based information-flow security. IEEE Journal on Selected Areas in Communications 21, 1 (Jan 2003), 5–19. Google Scholar
Digital Library
- Amr Sabry and Matthias Felleisen. 1993. Reasoning about programs in continuation-passing style. Lisp and symbolic computation 6, 3-4 (1993), 289–360.Google Scholar
- John Sall. 1989. JMP: Design of Experiments. https://www.jmp.com/en_us/about.htmlGoogle Scholar
- Richard Scheines. 2003. Causal Reasoning: Disseminating New Curricula with Online Courseware. Presented at the American Education Research Association.Google Scholar
- Jasjeet S Sekhon. 2008. The Neyman-Rubin model of causal inference and estimation via matching methods. In The Oxford Handbook of Political Methodology . Oxford University Press, Oxford, UK, 271–299.Google Scholar
- Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. In ACM SIGSOFT Software Engineering Notes , Vol. 30. ACM, New York, NY, USA, 263–272. Issue 5.Google Scholar
Digital Library
- William R. Shadish, Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference . Houghton Mifflin Company, Boston, MA, USA.Google Scholar
- Alex Sherman, Philip A Lisiecki, Andy Berkheimer, and Joel Wein. 2005. ACMS: The Akamai configuration management system. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, Berkley, CA, USA, 245–258.Google Scholar
- Jaeho Shin, Andreas Paepcke, and Jennifer Widom. 2013. 3X: A Data Management System for Computational Experiments (Demonstration Proposal) . Technical Report. Stanford University. http://ilpubs.stanford.edu:8090/1080/Google Scholar
- Lisa B Signorello, Joseph K McLaughlin, Loren Lipworth, Søren Friis, Henrik Toft Sørensen, and William J Blot. 2002. Confounding by indication in epidemiologic studies of commonly used analgesics. American journal of therapeutics 9, 3 (2002), 199–205.Google Scholar
- American Cancer Society. 2019a. How Common is Breast Cancer? https://www.cancer.org/cancer/breast-cancer/about/howcommon-is-breast-cancer.html [Online; Last accessed 12 August 2019.].Google Scholar
- American Cancer Society. 2019b. Key Statistics for Breast Cancer in Men. https://www.cancer.org/cancer/breast-cancer-inmen/about/key-statistics.html [Online; Last accessed 12 August 2019.].Google Scholar
- Scott Spangler, Angela D Wilkins, Benjamin J Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R Pickering, Austin Comer, Jeffrey N Myers, et al. 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 1877–1886.Google Scholar
Digital Library
- David I Spivak. 2014. Category theory for the sciences. MIT Press, Cambridge, MA, USA.Google Scholar
- David I Spivak and Robert E Kent. 2012. Ologs: a categorical framework for knowledge representation. PLoS One 7, 1 (2012), e24274.Google Scholar
Cross Ref
- Jerzy Splawa-Neyman, DM Dabrowska, TP Speed, et al. 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. 5, 4 (1990), 465–472. [Updated 1900].Google Scholar
Cross Ref
- Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Automatic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 1633–1642. Google Scholar
Digital Library
- Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl. 2015. Holistic configuration management at Facebook. In Proceedings of the 25th Symposium on Operating Systems Principles . ACM, New York, NY, USA, 328–343.Google Scholar
Digital Library
- Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, New York, NY, USA, 17–26.Google Scholar
Digital Library
- Santtu Tikka and Juha Karvanen. 2017. Identifying Causal Effects with the R Package. Journal of Statistical Software 76 (February 2017), 1–30. Issue 12.Google Scholar
Cross Ref
- Emma Tosch and Emery D. Berger. 2014. SurveyMan: Programming and Automatically Debugging Surveys. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14) . ACM, New York, NY, USA, 197–211. Google Scholar
Digital Library
- Leslie G Valiant. 1979. The complexity of computing the permanent. Theoretical computer science 8, 2 (1979), 189–201.Google Scholar
- Wei Wang, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. Forecasting elections with non-representative polls. International Journal of Forecasting 31, 3 (2015), 980–991.Google Scholar
Cross Ref
- Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A New Approach to Probabilistic Programming Inference. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Vol. 33. JMLR: W&CP, Online, 1024–1032.Google Scholar
Index Terms
PlanAlyzer: assessing threats to the validity of online experiments
Recommendations
Designing and deploying online field experiments
WWW '14: Proceedings of the 23rd international conference on World wide webOnline experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative ...
Threats to validity in controlled experiments in software engineering: what the experts say and why this is relevant
SBES '18: Proceedings of the XXXII Brazilian Symposium on Software EngineeringContext: Every experimental study has some threats to validity hindering its results. Goal: Improve software engineering controlled experiments quality by better understanding threats to validity control process. Method: A systematic Survey was executed ...
A conceptual model to address threats to validity in controlled experiments
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software EngineeringContext: During the planning phase of an experiment, the threats to validity must be identified in order to assess their impact over the data. In addition, the actions to address these threats must be defined (if possible). Objective: This paper ...






Comments