10.1145/3025453.3025912acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedings
research-article
Honorable Mention

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

ABSTRACT

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.

References

  1. Anscombe, F.J. (1973). Graphs in Statistical Analysis. The American Statistician 27, 1, 17--21. Google ScholarGoogle ScholarCross RefCross Ref
  2. Bach, B., Spritzer, A., Lutton, E., and Fekete, J.-D. (2012). Interactive Random Graph Generation with Evolutionary Algorithms. SpringerLink, 541--552.Google ScholarGoogle Scholar
  3. Blyth, C.R. (1972). On Simpson's Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67, 338, 364--366. Google ScholarGoogle ScholarCross RefCross Ref
  4. Cairo, A. Download the Datasaurus: Never trust summary statistics alone; always visualize your data. http://www.thefunctionalart.com/2016/08/downloaddatasaurus-never-trust-summary.html.Google ScholarGoogle Scholar
  5. Chatterjee, S. and Firat, A. (2007). Generating Data with Identical Statistics but Dissimilar Graphics. The American Statistician 61, 3, 248--254. Google ScholarGoogle ScholarCross RefCross Ref
  6. Fung, B.C.M., Wang, K., Chen, R., and Yu, P.S. (2010). Privacy-preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv. 42, 4, 14:1--14:53. Google ScholarGoogle ScholarCross RefCross Ref
  7. Govindaraju, K. and Haslett, S.J. (2008). Illustration of regression towards the means. International Journal of Mathematical Education in Science and Technology 39, 4, 544--550. Google ScholarGoogle ScholarCross RefCross Ref
  8. Haslett, S.J. and Govindaraju, K. (2009). Cloning Data: Generating Datasets with Exactly the Same Multiple Linear Regression Fit. Australian & New Zealand Journal of Statistics 51, 4, 499--503. Google ScholarGoogle ScholarCross RefCross Ref
  9. Hwang, C.-R. Simulated annealing: Theory and applications. Acta Applicandae Mathematica 12, 1, 108--111.Google ScholarGoogle Scholar
  10. Simpson, E.H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society. Series B (Methodological) 13, 2, 238--241.Google ScholarGoogle ScholarCross RefCross Ref
  11. Stefanski, L.A. (2007). Residual (Sur)Realism. The American Statistician, . Google ScholarGoogle ScholarCross RefCross Ref
  12. Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics 16, 6, 973--979. Google ScholarGoogle ScholarDigital LibraryDigital Library

Supplemental Material

pn3600p.mp4

pn3600-file3.mp4

Index Terms

  1. Same Stats, Different Graphs

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!