Abstract
Analysing and learning from spatio-temporal datasets is an important process in many domains, including transportation, healthcare and meteorology. In particular, data collected by sensors in the environment allows us to understand and model the processes acting within the environment. Recently, the volume of spatio-temporal data collected has increased significantly, presenting several challenges for data scientists. Methods are therefore needed to reduce the quantity of data that needs to be processed in order to analyse and learn from spatio-temporal datasets. In this article, we present the \(\)-Dimensional Spatio-Temporal Reduction method (\(\)D-STR) for reducing the quantity of data used to store a dataset whilst enabling multiple types of analysis on the reduced dataset. \(\)D-STR uses hierarchical partitioning to find spatio-temporal regions of similar instances, and models the instances within each region to summarise the dataset. We demonstrate the generality of \(\)D-STR with three datasets exhibiting different spatio-temporal characteristics and present results for a range of data modelling techniques. Finally, we compare \(\)D-STR with other techniques for reducing the volume of spatio-temporal data. Our results demonstrate that \(\)D-STR is effective in reducing spatio-temporal data and generalises to datasets that exhibit different properties.
- David Aha, Dennis Kibler, and Marc Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (Jan. 1991), 37–66. DOI:https://doi.org/10.1023/A:1022689900470 Google Scholar
Digital Library
- Hussein Almuallim and Thomas G. Dietterich. 1991. Learning with many irrelevant features. In Proceedings of the 9th National Conference on Artificial Intelligence. AAAI Press, 547–552. http://dl.acm.org/citation.cfm?id=1865756.1865761. Google Scholar
Digital Library
- Franz Aurenhammer. 1991. Voronoi diagrams—A survey of a fundamental geometric data structure. ACM Computing Surveys 23, 3 (Sept. 1991), 345–405. DOI:https://doi.org/10.1145/116873.116880 Google Scholar
Digital Library
- Dimitris Berberidis and Georgios B. Giannakis. 2017. Data sketching for large-scale Kalman filtering. IEEE Transactions on Signal Processing 65, 14 (July 2017), 3688–3701. DOI:https://doi.org/10.1109/TSP.2017.2691662 Google Scholar
Digital Library
- Darius Birvinskas, Vacius Jusas, Ignas Martisius, and Robertas Damasevicius. 2012. EEG dataset reduction and feature extraction using discrete cosine transform. In 2012 6th UKSim/AMSS European Symposium on Computer Modeling and Simulation. 199–204. DOI:https://doi.org/10.1109/EMS.2012.88 Google Scholar
Digital Library
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (July 1970), 422–426. DOI:https://doi.org/10.1145/362686.362692 Google Scholar
Digital Library
- Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. 1984. Classification and Regression Trees. CRC Press.Google Scholar
- Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40, 1 (2014), 16–28. DOI:https://doi.org/10.1016/j.compeleceng.2013.11.024 Google Scholar
Digital Library
- Chin-Liang Chang. 1974. Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers C-23, 11 (Nov. 1974), 1179–1184. DOI:https://doi.org/10.1109/T-C.1974.223827 Google Scholar
Digital Library
- Michael Chipeta, Dianne Terlouw, Kamija Phiri, and Peter Diggle. 2017. Inhibitory geostatistical designs for spatial prediction taking account of uncertain covariance structure. Environmetrics 28, 1 (2017), e2425. DOI:https://doi.org/10.1002/env.2425 e2425 env.2425.Google Scholar
- A. B. Arockia Christopher and S. Appavu alias Balamurugan. 2013. Feature selection techniques for prediction of warning level in aircraft accidents. In Proceedings of the 2013 International Conference on Advanced Computing and Communication Systems (ICACCS’13). IEEE, 1–6.Google Scholar
- Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1–3 (2012), 1–294. Google Scholar
Digital Library
- Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75. Google Scholar
Digital Library
- Eric M. Delmelle. 2014. Spatial Sampling. Springer, Berlin, 1385–1399. DOI:https://doi.org/10.1007/978-3-642-23430-9_73Google Scholar
- Urška Demšar, Paul Harris, Chris Brunsdon, A Stewart Fotheringham, and Sean McLoone. 2013. Principal component analysis on spatial data: An overview. Annals of the Association of American Geographers 103, 1 (2013), 106–128.Google Scholar
Cross Ref
- Peter Deutsch. 1996. DEFLATE Compressed Data Format Specification Version 1.3. Technical Report.Google Scholar
- Highways England. 2018. Highways England network journey time and traffic flow data. Retrieved October 3, 2018 from http://webtris.highwaysengland.co.uk.Google Scholar
- Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2, 3 (Sept. 2014), 267–279. DOI:https://doi.org/10.1109/TETC.2014.2330519Google Scholar
Cross Ref
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. 137–156.Google Scholar
- Shlomo Geva and Joaquin Sitte. 1991. Adaptive nearest neighbor pattern classification. IEEE Transactions on Neural Networks 2, 2 (March 1991), 318–322. DOI:https://doi.org/10.1109/72.80344 Google Scholar
Digital Library
- Andreas Janecek, Wilfried Gansterer, Michael Demel, and Gerhard Ecker. 2008. On the relationship between feature selection and classification accuracy. In Proceedings of the Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery at ECML/PKDD 2008, Yvan Saeys, Huan Liu, Iñaki Inza, Louis Wehenkel, and Yves Van de Pee (Eds.). PMLR, 90–105. http://proceedings.mlr.press/v4/janecek08a.html. Google Scholar
Digital Library
- Kenji Kira and Larry A. Rendell. 1992. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the 10th National Conference on Artificial Intelligence. 129–134. Google Scholar
Digital Library
- Teuvo Kohonen (Ed.). 2001. Self-Organizing Maps (3rd ed.). Springer-Verlag. http://www.worldcat.org/title/self-organizing-maps/oclc/807084376. Google Scholar
Digital Library
- Sriram Lakshminarasimhan, John Jenkins, Isha Arkatkar, Zhenhuan Gong, Hemanth Kolla, Seung-Hoe Ku, Stephane Ethier, Jackie Chen, C. S. Chang, Scott Klasky, Robert Latham, Robert Ross, and Nagiza F Samatova. 2011. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data. In SC 2011. 1–11. DOI:https://doi.org/10.1145/2063384.2063425 Google Scholar
Digital Library
- Dongeun Lee, Alex Sim, Jaesik Choi, and Kesheng Wu. 2016. Novel data reduction based on statistical similarity. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management. ACM, 21:1–21:12. DOI:https://doi.org/10.1145/2949689.2949708 Google Scholar
Digital Library
- Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (Dec. 2017), 94:1–94:45. DOI:https://doi.org/10.1145/3136625 Google Scholar
Digital Library
- Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng Zhao. 2010. Feature selection: An ever evolving frontier in data mining. In Feature Selection in Data Mining. 4–13.Google Scholar
- Qin Liu, Ran Chen, Hongming Zhu, and Hongfei Fan. 2017. Research and comparison of data dimensionality reduction algorithms. In ICBCI 2017. ACM, 49–54. DOI:https://doi.org/10.1145/3135954.3135965 Google Scholar
Digital Library
- Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. 2015. Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (April 2015), 865–873. DOI:https://doi.org/10.1109/TITS.2014.2345663Google Scholar
Digital Library
- Aleix M. Martinez and Avinash C. Kak. 2001. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 2 (Feb. 2001), 228–233. DOI:https://doi.org/10.1109/34.908974 Google Scholar
Digital Library
- Syrine Ben Meskina. 2013. On the effect of data reduction on classification accuracy. In Proceedings of the 2013 3rd International Conference on Information Technology and e-Services (ICITeS’13). IEEE, 1–7.Google Scholar
- Daniel Mullner. [n.d.]. Fastcluster. Retrieved May 4, 2019 from http://danifold.net/fastcluster.html.Google Scholar
- Fionn Murtagh and Pedro Contreras. 2017. Algorithms for hierarchical clustering: An overview, II. WIREs Data Mining and Knowledge Discovery 7, 6 (2017), e1219. DOI:https://doi.org/10.1002/widm.1219 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1219Google Scholar
Cross Ref
- Met Office. 2012. Met Office Integrated Data Archive System (MIDAS) Land and Marine Surface Stations Data (1853-current). Retrieved March 3, 2018 from http://catalogue.ceda.ac.uk/uuid/220a65615218d5c9cc9e4785a3234bd0.Google Scholar
- Stefanos Ougiaroglou and Georgios Evangelidis. 2012. Efficient dataset size reduction by finding homogeneous clusters. In Proceedings of the 5th Balkan Conference in Informatics. ACM, 168–173. DOI:https://doi.org/10.1145/2371316.2371349 Google Scholar
Digital Library
- Bei Pan, Ugur Demiryurek, Farnoush Banaei-Kashani, and Cyrus Shahabi. 2010. Spatiotemporal summarization of traffic data streams. In Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming. ACM, 4–10. DOI:https://doi.org/10.1145/1878500.1878504 Google Scholar
Digital Library
- Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.Google Scholar
Cross Ref
- K. R. Rao and P. Yip. 2014. Discrete Cosine Transform: Algorithms, Advantages, Applications. Elsevier Science. Google Scholar
Digital Library
- Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323–2326.Google Scholar
Cross Ref
- Hanan Samet. 1984. The quadtree and related hierarchical data structures. ACM Computing Surveys 16, 2 (June 1984), 187–260. DOI:https://doi.org/10.1145/356924.356930 Google Scholar
Digital Library
- Priyanka Sinha. 2013. Multivariate polynomial regression in data mining: Methodology, problems and solutions. International Journal of Scientific and Engineering Research 4, 12 (2013), 962–965.Google Scholar
- Sabina Sisovic, Marija Brkic Bakaric, and Maja Matetic. 2018. Reducing data stream complexity by applying count-min algorithm and discretization procedure. In Proceedings of the 2018 IEEE 4th International Conference on Big Data Computing Service and Applications (BigDataService’18). 221–228. DOI:https://doi.org/10.1109/BigDataService.2018.00040Google Scholar
- Liam Steadman, Nathan Griffiths, Stephen Jarvis, Stuart McRobbie, and Caroline Wallbank. 2019. 2D-STR: Reducing spatio-temporal traffic datasets by partitioning and modelling. Proceedings of the 5th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM’19).Google Scholar
- Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching linear classifiers over data streams. In Proceedings of the 2018 International Conference on Management of Data. ACM, 757–772. DOI:https://doi.org/10.1145/3183713.3196930 Google Scholar
Digital Library
- Waldo R. Tobler. 1970. A computer movie simulating urban growth in the detroit region. Economic Geography 46 (1970), 234–240. http://www.jstor.org/stable/143141.Google Scholar
Cross Ref
- Isaac Triguero, Joaquín Derrac, Salvador Garcia, and Francisco Herrera. 2012. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 1 (Jan. 2012), 86–100. DOI:https://doi.org/10.1109/TSMCC.2010.2103939 Google Scholar
Digital Library
- Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175–186.Google Scholar
Cross Ref
- Mingliang Wang, Han-Xiong Li, and Wenjing Shen. 2016. Deep auto-encoder in model reduction of large-scale spatiotemporal dynamics. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN’16). 3180–3186. DOI:https://doi.org/10.1109/IJCNN.2016.7727605Google Scholar
- Michael Whelan, Nhien An Le Khac, and M.-Tahar Kechadi. 2010. Data reduction in very large spatio-temporal datasets. In Proceedings of the 2010 19th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises. 104–109. DOI:https://doi.org/10.1109/WETICE.2010.23 Google Scholar
Digital Library
- Michael Whelan, Nhien-An Le-Khac, and M.-Tahar Kechadi. 2011. Comparing two density-based clustering methods for reducing very large spatio-temporal dataset. In Proceedings of the 2011 IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services. 519–524. DOI:https://doi.org/10.1109/ICSDM.2011.5969100Google Scholar
- Liang Ze Wong, Huiling Chen, Shaowei Lin, and Daniel Chongli Chen. 2014. Imputing missing values in sensor networks using sparse data representations. In Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems. ACM, 227–230. DOI:https://doi.org/10.1145/2641798.2641816 Google Scholar
Digital Library
- Kesheng Wu, Dongeun Lee, Alex Sim, and Jaesik Choi. 2017. Statistical data reduction for streaming data. In 2017 New York Scientific Data Summit (NYSDS’17). 1–6. DOI:https://doi.org/10.1109/NYSDS.2017.8085035Google Scholar
- Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2016. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (Aug. 2016), 606–626. DOI:https://doi.org/10.1109/TEVC.2015.2504420Google Scholar
Digital Library
Index Terms
kD-STR: A Method for Spatio-Temporal Data Reduction and Modelling
Recommendations
Reducing and linking spatio-temporal datasets with kD-STR
ARIC '20: Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent CitiesWhen linking spatio-temporal datasets, the kD-STR algorithm can be used to reduce the datasets and speed up the linking process. However, kD-STR can sacrifice accuracy in the linked dataset whilst retaining unnecessary information. To overcome this, we ...
2D-STR: Reducing Spatio-temporal Traffic Datasets by Partitioning and Modelling
GISTAM 2019: Proceedings of the 5th International Conference on Geographical Information Systems Theory, Applications and ManagementSpatio-temporal data generated by sensors in the environment, such as traffic data, is widely used in the
transportation domain. However, learning from and analysing such data is increasingly problematic as the
volume of data grows. Therefore, methods ...
Data Reduction in Very Large Spatio-Temporal Datasets
WETICE '10: Proceedings of the 2010 19th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative EnterprisesToday, huge amounts of data are being collected with spatial and temporal components from sources such as metrological, satellite imagery etc.. Efficient visualisation as well as discovery of useful knowledge from these datasets is therefore very ...






Comments