ABSTRACT

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.
Supplemental Material
- Anscombe, F.J. (1973). Graphs in Statistical Analysis. The American Statistician 27, 1, 17--21. Google Scholar
Cross Ref
- Bach, B., Spritzer, A., Lutton, E., and Fekete, J.-D. (2012). Interactive Random Graph Generation with Evolutionary Algorithms. SpringerLink, 541--552.Google Scholar
- Blyth, C.R. (1972). On Simpson's Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67, 338, 364--366. Google Scholar
Cross Ref
- Cairo, A. Download the Datasaurus: Never trust summary statistics alone; always visualize your data. http://www.thefunctionalart.com/2016/08/downloaddatasaurus-never-trust-summary.html.Google Scholar
- Chatterjee, S. and Firat, A. (2007). Generating Data with Identical Statistics but Dissimilar Graphics. The American Statistician 61, 3, 248--254. Google Scholar
Cross Ref
- Fung, B.C.M., Wang, K., Chen, R., and Yu, P.S. (2010). Privacy-preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv. 42, 4, 14:1--14:53. Google Scholar
Cross Ref
- Govindaraju, K. and Haslett, S.J. (2008). Illustration of regression towards the means. International Journal of Mathematical Education in Science and Technology 39, 4, 544--550. Google Scholar
Cross Ref
- Haslett, S.J. and Govindaraju, K. (2009). Cloning Data: Generating Datasets with Exactly the Same Multiple Linear Regression Fit. Australian & New Zealand Journal of Statistics 51, 4, 499--503. Google Scholar
Cross Ref
- Hwang, C.-R. Simulated annealing: Theory and applications. Acta Applicandae Mathematica 12, 1, 108--111.Google Scholar
- Simpson, E.H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society. Series B (Methodological) 13, 2, 238--241.Google Scholar
- Stefanski, L.A. (2007). Residual (Sur)Realism. The American Statistician, . Google Scholar
Cross Ref
- Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics 16, 6, 973--979. Google Scholar
Digital Library
Index Terms
Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing
Recommendations
Visual Exploration of Large Scatter Plot Matrices by Pattern Recommendation based on Eye Tracking
ESIDA '17: Proceedings of the 2017 ACM Workshop on Exploratory Search and Interactive Data AnalyticsThe Scatter Plot Matrix (SPLOM) is a well-known technique for visual analysis of high-dimensional data. However, one problem of large SPLOMs is that typically not all views are potentially relevant to a given analysis task or user. The matrix itself may ...
Multifield-Graphs: An Approach to Visualizing Correlations in Multifield Scalar Data
We present an approach to visualizingcorrelations in 3D multifield scalar data. The core of our approach is the computation of correlation fields, which are scalar fields containing the local correlations of subsets of the multiple fields.While the ...
Dynamic Opacity Optimization for Scatter Plots
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsScatterplots are an effective and commonly used technique to show the relationship between two variables. However, as the number of data points increases, the chart suffers from "over-plotting" which obscures data points and makes the underlying ...





Comments