Abstract
Regression formulas are a domain-specific language adopted by several R packages for describing an important and useful class of statistical models: hierarchical linear regressions. Formulas are succinct, expressive, and clearly popular, so are they a useful addition to probabilistic programming languages? And what do they mean? We propose a core calculus of hierarchical linear regression, in which regression coefficients are themselves defined by nested regressions (unlike in R). We explain how our calculus captures the essence of the formula DSL found in R. We describe the design and implementation of Fabular, a version of the Tabular schema-driven probabilistic programming language, enriched with formulas based on our regression calculus. To the best of our knowledge, this is the first formal description of the core ideas of R's formula notation, the first development of a calculus of regression formulas, and the first demonstration of the benefits of composing regression formulas and latent variables in a probabilistic programming language.
- D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting Linear Mixed-Effects Models using lme4. ArXiv, 2014. arXiv:1406.5823 {stat.CO}. S. Bhat, J. Borgström, A. D. Gordon, and C. V. Russo. Deriving probability density functions from probabilistic functional programs. In N. Peterman and S. Smolka, editors, Tools and Algorithms for the Construction and Analysis of Systems (TACAS’13), volume 7795 of Lecture Notes in Computer Science, pages 508–522. Springer, 2013. Google Scholar
Digital Library
- J. Borgström, A. D. Gordon, M. Greenberg, J. Margetson, and J. V. Gael. Measure transformer semantics for Bayesian machine learning. Logical Methods in Computer Science, 9(3), 2013. Preliminary version at ESOP’11. J. Borgström, A. D. Gordon, L. Ouyang, C. Russo, A. Ścibior, and M. Szymczak. Fabular: Regression formulas as probabilistic programming. Technical Report MSR–TR–2015–83, Microsoft Research, 2015.Google Scholar
- V. Dorie. Mixed Methods for Mixed Models. PhD thesis, Columbia University, 2014.Google Scholar
- A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.Google Scholar
- W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex Bayesian modelling. The Statistician, 43:169–178, 1994.Google Scholar
Cross Ref
- N. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for generative models. In Uncertainty in Artificial Intelligence (UAI’08), pages 220–229. AUAI Press, 2008.Google Scholar
- N. D. Goodman. The principles and practice of probabilistic programming. In Principles of Programming Languages (POPL’13), pages 399–402, 2013. Google Scholar
Digital Library
- A. D. Gordon, M. Aizatulin, J. Borgström, G. Claret, T. Graepel, A. Nori, S. Rajamani, and C. Russo. A model-learner pattern for Bayesian reasoning. In Principles of Programming Languages (POPL’13), 2013. Google Scholar
Digital Library
- A. D. Gordon, T. Graepel, N. Rolland, C. V. Russo, J. Borgström, and J. Guiver. Tabular: a schema-driven probabilistic programming language. In Principles of Programming Languages (POPL’14), 2014a. A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. In Future of Software Engineering (FOSE 2014), pages 167–181, 2014b. A. D. Gordon, C. V. Russo, M. Szymczak, J. Borgström, N. Rolland, T. Graepel, and D. Tarlow. Probabilistic programs as spreadsheet queries. In J. Vitek, editor, Programming Languages and Systems (ESOP 2015), volume 9032 of Lecture Notes in Computer Science, pages 1–25. Springer, 2015. Google Scholar
Digital Library
- R. Hahn. Statistical formula notation in R. URL http: //faculty.chicagobooth.edu/richard.hahn/teaching/ FormulaNotation.pdf. O. Kiselyov and C. Shan. Embedded probabilistic programming. In Conference on Domain-Specific Languages, volume 5658 of Lecture Notes in Computer Science, pages 360–384. Springer, 2009. Google Scholar
Digital Library
- D. Lunn, C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter. The BUGS Book. CRC Press, 2013.Google Scholar
- V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programming platform with programmable inference. CoRR, 2014. arXiv:1404.0099v1 {cs.AI}. B. Milch, B. Marthi, S. J. Russell, D. Sontag, D. L. Ong, and A. Kolobov. Statistical Relational Learning, chapter BLOG: Probabilistic Models with Unknown Objects. MIT Press, 2007.Google Scholar
- T. Minka, J. Winn, J. Guiver, and A. Kannan. Infer.NET 2.3, Nov. 2009. Software available from http://research.microsoft.com/ infernet. T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001. Google Scholar
Digital Library
- F. Morandat, B. Hill, L. Osvald, and J. Vitek. Evaluating the design of the R language - objects and functions for data analysis. In J. Noble, editor, ECOOP 2012 - Object-Oriented Programming, volume 7313 of Lecture Notes in Computer Science, pages 104–131. Springer, 2012. Google Scholar
Digital Library
- A. V. Nori, C.-K. Hur, S. K. Rajamani, and S. Samuel. R2: An efficient MCMC sampler for probabilistic programs. In Conference on Artificial Intelligence. AAAI, July 2014.Google Scholar
- B. Paige and F. Wood. A compilation target for probabilistic programming languages. In ICML, 2014.Google Scholar
- A. Pfeffer. Figaro: An object-oriented probabilistic programming language. Technical report, Charles River Analytics, 2009.Google Scholar
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. URL http://www.R-project.org/. S. R. Riedel, S. Singh, V. Srikumar, T. Rocktäschel, L. Visengeriyeva, and J. Noessner. WOLFE: strength reduction and approximate programming for probabilistic programming. In Statistical Relational Artificial Intelligence (StarAI 2014), volume WS-14-13 of AAAI Technical Report. The AAAI Press, 2014.Google Scholar
- Stan Development Team. Stan: A C++ library for probability and sampling, version 2.2, 2014a. URL http://mc-stan.org/. Stan Development Team. RStan: the R interface to Stan, version 2.5.0, 2014b. URL http://mc-stan.org/rstan.html. D. H. Stern, R. Herbrich, and T. Graepel. Matchbox: large scale online Bayesian recommendations. In J. Quemada, G. León, Y. S. Maarek, and W. Nejdl, editors, Proceedings of the 18th International Conference on World Wide Web (WWW 2009), pages 111–120. ACM, 2009. Google Scholar
Digital Library
- S. E. Whaley, M. Sigman, C. Neumann, N. Bwibo, D. Guthrie, R. E. Weiss, S. Alber, and S. P. Murphy. The impact of dietary intervention on the cognitive development of Kenyan school children. The Journal of Nutrition, 133(11):3965S–3971S, 2003.Google Scholar
Cross Ref
- F. Wood, J. W. van de Meent, and V. Mansinghka. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, volume 33 of JMLR Workshop and Conference Proceedings, 2014.Google Scholar
- arXiv:1403.0504v2 {cs.AI}.Google Scholar
Index Terms
Fabular: regression formulas as probabilistic programming
Recommendations
Fabular: regression formulas as probabilistic programming
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesRegression formulas are a domain-specific language adopted by several R packages for describing an important and useful class of statistical models: hierarchical linear regressions. Formulas are succinct, expressive, and clearly popular, so are they a ...
An Efficient Algorithm for the MSAE and the MMAE Regression Problems
In the past quarter century, the minimum sum of absolute errors (MSAE) regression and the minimization of the maximum absolute error (MMAE) regression have attracted much attention as alternatives to the popular least squares regression. For the ...
Obtaining a threshold for the stewart index and its extension to ridge regression
AbstractThe linear regression model is widely applied to measure the relationship between a dependent variable and a set of independent variables. When the independent variables are related to each other, it is said that the model presents collinearity. ...






Comments