ABSTRACT

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.
References
- Anscombe, F.J. (1973). Graphs in Statistical Analysis. The American Statistician 27, 1, 17--21. Google Scholar
Cross Ref
- Bach, B., Spritzer, A., Lutton, E., and Fekete, J.-D. (2012). Interactive Random Graph Generation with Evolutionary Algorithms. SpringerLink, 541--552.Google Scholar
- Blyth, C.R. (1972). On Simpson's Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67, 338, 364--366. Google Scholar
Cross Ref
- Cairo, A. Download the Datasaurus: Never trust summary statistics alone; always visualize your data. http://www.thefunctionalart.com/2016/08/downloaddatasaurus-never-trust-summary.html.Google Scholar
- Chatterjee, S. and Firat, A. (2007). Generating Data with Identical Statistics but Dissimilar Graphics. The American Statistician 61, 3, 248--254. Google Scholar
Cross Ref
- Fung, B.C.M., Wang, K., Chen, R., and Yu, P.S. (2010). Privacy-preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv. 42, 4, 14:1--14:53. Google Scholar
Cross Ref
- Govindaraju, K. and Haslett, S.J. (2008). Illustration of regression towards the means. International Journal of Mathematical Education in Science and Technology 39, 4, 544--550. Google Scholar
Cross Ref
- Haslett, S.J. and Govindaraju, K. (2009). Cloning Data: Generating Datasets with Exactly the Same Multiple Linear Regression Fit. Australian & New Zealand Journal of Statistics 51, 4, 499--503. Google Scholar
Cross Ref
- Hwang, C.-R. Simulated annealing: Theory and applications. Acta Applicandae Mathematica 12, 1, 108--111.Google Scholar
- Simpson, E.H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society. Series B (Methodological) 13, 2, 238--241.Google Scholar
Cross Ref
- Stefanski, L.A. (2007). Residual (Sur)Realism. The American Statistician, . Google Scholar
Cross Ref
- Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics 16, 6, 973--979. Google Scholar
Digital Library
Supplemental Material
Index Terms
Same Stats, Different Graphs




Comments