skip to main content
research-article
Open Access

kD-STR: A Method for Spatio-Temporal Data Reduction and Modelling

Published:18 May 2021Publication History
Skip Abstract Section

Abstract

Analysing and learning from spatio-temporal datasets is an important process in many domains, including transportation, healthcare and meteorology. In particular, data collected by sensors in the environment allows us to understand and model the processes acting within the environment. Recently, the volume of spatio-temporal data collected has increased significantly, presenting several challenges for data scientists. Methods are therefore needed to reduce the quantity of data that needs to be processed in order to analyse and learn from spatio-temporal datasets. In this article, we present the \(\)-Dimensional Spatio-Temporal Reduction method (\(\)D-STR) for reducing the quantity of data used to store a dataset whilst enabling multiple types of analysis on the reduced dataset. \(\)D-STR uses hierarchical partitioning to find spatio-temporal regions of similar instances, and models the instances within each region to summarise the dataset. We demonstrate the generality of \(\)D-STR with three datasets exhibiting different spatio-temporal characteristics and present results for a range of data modelling techniques. Finally, we compare \(\)D-STR with other techniques for reducing the volume of spatio-temporal data. Our results demonstrate that \(\)D-STR is effective in reducing spatio-temporal data and generalises to datasets that exhibit different properties.

References

  1. David Aha, Dennis Kibler, and Marc Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (Jan. 1991), 37–66. DOI:https://doi.org/10.1023/A:1022689900470 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hussein Almuallim and Thomas G. Dietterich. 1991. Learning with many irrelevant features. In Proceedings of the 9th National Conference on Artificial Intelligence. AAAI Press, 547–552. http://dl.acm.org/citation.cfm?id=1865756.1865761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Franz Aurenhammer. 1991. Voronoi diagrams—A survey of a fundamental geometric data structure. ACM Computing Surveys 23, 3 (Sept. 1991), 345–405. DOI:https://doi.org/10.1145/116873.116880 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dimitris Berberidis and Georgios B. Giannakis. 2017. Data sketching for large-scale Kalman filtering. IEEE Transactions on Signal Processing 65, 14 (July 2017), 3688–3701. DOI:https://doi.org/10.1109/TSP.2017.2691662 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Darius Birvinskas, Vacius Jusas, Ignas Martisius, and Robertas Damasevicius. 2012. EEG dataset reduction and feature extraction using discrete cosine transform. In 2012 6th UKSim/AMSS European Symposium on Computer Modeling and Simulation. 199–204. DOI:https://doi.org/10.1109/EMS.2012.88 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (July 1970), 422–426. DOI:https://doi.org/10.1145/362686.362692 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. 1984. Classification and Regression Trees. CRC Press.Google ScholarGoogle Scholar
  8. Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40, 1 (2014), 16–28. DOI:https://doi.org/10.1016/j.compeleceng.2013.11.024 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chin-Liang Chang. 1974. Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers C-23, 11 (Nov. 1974), 1179–1184. DOI:https://doi.org/10.1109/T-C.1974.223827 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael Chipeta, Dianne Terlouw, Kamija Phiri, and Peter Diggle. 2017. Inhibitory geostatistical designs for spatial prediction taking account of uncertain covariance structure. Environmetrics 28, 1 (2017), e2425. DOI:https://doi.org/10.1002/env.2425 e2425 env.2425.Google ScholarGoogle Scholar
  11. A. B. Arockia Christopher and S. Appavu alias Balamurugan. 2013. Feature selection techniques for prediction of warning level in aircraft accidents. In Proceedings of the 2013 International Conference on Advanced Computing and Communication Systems (ICACCS’13). IEEE, 1–6.Google ScholarGoogle Scholar
  12. Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1–3 (2012), 1–294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eric M. Delmelle. 2014. Spatial Sampling. Springer, Berlin, 1385–1399. DOI:https://doi.org/10.1007/978-3-642-23430-9_73Google ScholarGoogle Scholar
  15. Urška Demšar, Paul Harris, Chris Brunsdon, A Stewart Fotheringham, and Sean McLoone. 2013. Principal component analysis on spatial data: An overview. Annals of the Association of American Geographers 103, 1 (2013), 106–128.Google ScholarGoogle ScholarCross RefCross Ref
  16. Peter Deutsch. 1996. DEFLATE Compressed Data Format Specification Version 1.3. Technical Report.Google ScholarGoogle Scholar
  17. Highways England. 2018. Highways England network journey time and traffic flow data. Retrieved October 3, 2018 from http://webtris.highwaysengland.co.uk.Google ScholarGoogle Scholar
  18. Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2, 3 (Sept. 2014), 267–279. DOI:https://doi.org/10.1109/TETC.2014.2330519Google ScholarGoogle ScholarCross RefCross Ref
  19. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. 137–156.Google ScholarGoogle Scholar
  20. Shlomo Geva and Joaquin Sitte. 1991. Adaptive nearest neighbor pattern classification. IEEE Transactions on Neural Networks 2, 2 (March 1991), 318–322. DOI:https://doi.org/10.1109/72.80344 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andreas Janecek, Wilfried Gansterer, Michael Demel, and Gerhard Ecker. 2008. On the relationship between feature selection and classification accuracy. In Proceedings of the Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008, Yvan Saeys, Huan Liu, Iñaki Inza, Louis Wehenkel, and Yves Van de Pee (Eds.). PMLR, 90–105. http://proceedings.mlr.press/v4/janecek08a.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kenji Kira and Larry A. Rendell. 1992. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the 10th National Conference on Artificial Intelligence. 129–134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Teuvo Kohonen (Ed.). 2001. Self-Organizing Maps (3rd ed.). Springer-Verlag. http://www.worldcat.org/title/self-organizing-maps/oclc/807084376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sriram Lakshminarasimhan, John Jenkins, Isha Arkatkar, Zhenhuan Gong, Hemanth Kolla, Seung-Hoe Ku, Stephane Ethier, Jackie Chen, C. S. Chang, Scott Klasky, Robert Latham, Robert Ross, and Nagiza F Samatova. 2011. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data. In SC 2011. 1–11. DOI:https://doi.org/10.1145/2063384.2063425 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dongeun Lee, Alex Sim, Jaesik Choi, and Kesheng Wu. 2016. Novel data reduction based on statistical similarity. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management. ACM, 21:1–21:12. DOI:https://doi.org/10.1145/2949689.2949708 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (Dec. 2017), 94:1–94:45. DOI:https://doi.org/10.1145/3136625 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng Zhao. 2010. Feature selection: An ever evolving frontier in data mining. In Feature Selection in Data Mining. 4–13.Google ScholarGoogle Scholar
  28. Qin Liu, Ran Chen, Hongming Zhu, and Hongfei Fan. 2017. Research and comparison of data dimensionality reduction algorithms. In ICBCI 2017. ACM, 49–54. DOI:https://doi.org/10.1145/3135954.3135965 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. 2015. Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (April 2015), 865–873. DOI:https://doi.org/10.1109/TITS.2014.2345663Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Aleix M. Martinez and Avinash C. Kak. 2001. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 2 (Feb. 2001), 228–233. DOI:https://doi.org/10.1109/34.908974 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Syrine Ben Meskina. 2013. On the effect of data reduction on classification accuracy. In Proceedings of the 2013 3rd International Conference on Information Technology and e-Services (ICITeS’13). IEEE, 1–7.Google ScholarGoogle Scholar
  32. Daniel Mullner. [n.d.]. Fastcluster. Retrieved May 4, 2019 from http://danifold.net/fastcluster.html.Google ScholarGoogle Scholar
  33. Fionn Murtagh and Pedro Contreras. 2017. Algorithms for hierarchical clustering: An overview, II. WIREs Data Mining and Knowledge Discovery 7, 6 (2017), e1219. DOI:https://doi.org/10.1002/widm.1219 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1219Google ScholarGoogle ScholarCross RefCross Ref
  34. Met Office. 2012. Met Office Integrated Data Archive System (MIDAS) Land and Marine Surface Stations Data (1853-current). Retrieved March 3, 2018 from http://catalogue.ceda.ac.uk/uuid/220a65615218d5c9cc9e4785a3234bd0.Google ScholarGoogle Scholar
  35. Stefanos Ougiaroglou and Georgios Evangelidis. 2012. Efficient dataset size reduction by finding homogeneous clusters. In Proceedings of the 5th Balkan Conference in Informatics. ACM, 168–173. DOI:https://doi.org/10.1145/2371316.2371349 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bei Pan, Ugur Demiryurek, Farnoush Banaei-Kashani, and Cyrus Shahabi. 2010. Spatiotemporal summarization of traffic data streams. In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming. ACM, 4–10. DOI:https://doi.org/10.1145/1878500.1878504 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.Google ScholarGoogle ScholarCross RefCross Ref
  38. K. R. Rao and P. Yip. 2014. Discrete Cosine Transform: Algorithms, Advantages, Applications. Elsevier Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323–2326.Google ScholarGoogle ScholarCross RefCross Ref
  40. Hanan Samet. 1984. The quadtree and related hierarchical data structures. ACM Computing Surveys 16, 2 (June 1984), 187–260. DOI:https://doi.org/10.1145/356924.356930 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Priyanka Sinha. 2013. Multivariate polynomial regression in data mining: Methodology, problems and solutions. International Journal of Scientific and Engineering Research 4, 12 (2013), 962–965.Google ScholarGoogle Scholar
  42. Sabina Sisovic, Marija Brkic Bakaric, and Maja Matetic. 2018. Reducing data stream complexity by applying count-min algorithm and discretization procedure. In Proceedings of the 2018 IEEE 4th International Conference on Big Data Computing Service and Applications (BigDataService’18). 221–228. DOI:https://doi.org/10.1109/BigDataService.2018.00040Google ScholarGoogle Scholar
  43. Liam Steadman, Nathan Griffiths, Stephen Jarvis, Stuart McRobbie, and Caroline Wallbank. 2019. 2D-STR: Reducing spatio-temporal traffic datasets by partitioning and modelling. Proceedings of the 5th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM’19).Google ScholarGoogle Scholar
  44. Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching linear classifiers over data streams. In Proceedings of the 2018 International Conference on Management of Data. ACM, 757–772. DOI:https://doi.org/10.1145/3183713.3196930 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Waldo R. Tobler. 1970. A computer movie simulating urban growth in the detroit region. Economic Geography 46 (1970), 234–240. http://www.jstor.org/stable/143141.Google ScholarGoogle ScholarCross RefCross Ref
  46. Isaac Triguero, Joaquín Derrac, Salvador Garcia, and Francisco Herrera. 2012. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 1 (Jan. 2012), 86–100. DOI:https://doi.org/10.1109/TSMCC.2010.2103939 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175–186.Google ScholarGoogle ScholarCross RefCross Ref
  48. Mingliang Wang, Han-Xiong Li, and Wenjing Shen. 2016. Deep auto-encoder in model reduction of large-scale spatiotemporal dynamics. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN’16). 3180–3186. DOI:https://doi.org/10.1109/IJCNN.2016.7727605Google ScholarGoogle Scholar
  49. Michael Whelan, Nhien An Le Khac, and M.-Tahar Kechadi. 2010. Data reduction in very large spatio-temporal datasets. In Proceedings of the 2010 19th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises. 104–109. DOI:https://doi.org/10.1109/WETICE.2010.23 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Michael Whelan, Nhien-An Le-Khac, and M.-Tahar Kechadi. 2011. Comparing two density-based clustering methods for reducing very large spatio-temporal dataset. In Proceedings of the 2011 IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services. 519–524. DOI:https://doi.org/10.1109/ICSDM.2011.5969100Google ScholarGoogle Scholar
  51. Liang Ze Wong, Huiling Chen, Shaowei Lin, and Daniel Chongli Chen. 2014. Imputing missing values in sensor networks using sparse data representations. In Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems. ACM, 227–230. DOI:https://doi.org/10.1145/2641798.2641816 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kesheng Wu, Dongeun Lee, Alex Sim, and Jaesik Choi. 2017. Statistical data reduction for streaming data. In 2017 New York Scientific Data Summit (NYSDS’17). 1–6. DOI:https://doi.org/10.1109/NYSDS.2017.8085035Google ScholarGoogle Scholar
  53. Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2016. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (Aug. 2016), 606–626. DOI:https://doi.org/10.1109/TEVC.2015.2504420Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. kD-STR: A Method for Spatio-Temporal Data Reduction and Modelling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM/IMS Transactions on Data Science
          ACM/IMS Transactions on Data Science  Volume 2, Issue 3
          August 2021
          302 pages
          ISSN:2691-1922
          DOI:10.1145/3465442
          Issue’s Table of Contents

          Copyright © 2021 Association for Computing Machinery.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 May 2021
          • Accepted: 1 November 2020
          • Revised: 1 October 2020
          • Received: 1 June 2019
          Published in tds Volume 2, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!