ABSTRACT
Before scientists can analyze, publish, or share their data, they often need to determine how their datasets are related. Determining relationships helps scientists identify the most complete version of a dataset, detect versions of datasets that complement each other, and determine multiple datasets that overlap. In previous work, we showed how observable relationships between two datasets help scientists recall their original derivation connection. While that work helped with identifying relationships between two datasets, it is infeasible for scientists to use it for finding relationships between all possible pairs in a large collection of datasets. In order to deal with larger numbers of datasets, we are extending our methodology with a relationship-prediction system, ReDiscover, a tool to identify pairs from a collection of datasets that are most likely related and the relationship between them. We report on the initial design of ReDiscover, which uses machine-learning methods such as Conditional Random Fields and Support Vector Machines to the relationship-discovery problem. Our preliminarily evaluation shows that ReDiscover predicted relationships with an average accuracy of 87%.
- Alawini, A., Maier, D., Tufte, K., and Howe, B. Helping Scientists Reconnect their Datasets. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (2014), ACM, pp. 29:1--29:12. Google Scholar
Digital Library
- Chen, Z., Cafarella, M., Chen, J., Prevo, D., and Zhuang, J. SENBAZURU: A Prototype Spreadsheet Database Management System. Proceedings of the VLDB Endowment 6, 12 (2013), 1202--1205. Google Scholar
Digital Library
- Cortes, C., and Vapnik, V. Support-vector Networks. Machine learning 20, 3 (1995), 273--297. Google Scholar
Digital Library
- Cudre-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D. L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., and Zdonik, S. A Demonstration of SciDB: A Science-oriented DBMS. The Proceedings of the VLDB Endowment (PVLDB) 2, 2 (Aug. 2009), 1534--1537. Google Scholar
Digital Library
- Cunha, J., Saraiva, J. a., and Visser, J. From Spreadsheets to Relational Databases and Back. In Proceedings of the 2009 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation (New York, NY, USA, 2009), PEPM '09, ACM, pp. 179--188. Google Scholar
Digital Library
- Dong, B., Byna, S., and Wu, K. SDS: A Framework for Scientific Data Services. In Proceedings of the 8th Parallel Data Storage Workshop (New York, NY, USA, 2013), PDSW '13, ACM, pp. 27--32. Google Scholar
Digital Library
- Fisher, M., and Rothermel, G. the EUSES Spreadsheet Corpus: a Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In ACM SIGSOFT Software Engineering Notes (2005), vol. 30, ACM, pp. 1--5. Google Scholar
Digital Library
- GitHub. GitHub: Smarter Version Control. https://github.com/.Google Scholar
- Gonzalez, H., Halevy, A., Jensen, C., Langen, A., Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), ACM, pp. 1061--1066. Google Scholar
Digital Library
- Howe, B., Cole, G., Souroush, E., Koutris, P., Key, A., Khoussainova, N., and Battle, L. Database-as-a-Service for Long-Tail Science. In Scientific and Statistical Database Management, vol. 6809 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2011, pp. 480--489. Google Scholar
Digital Library
- Lafferty, J., McCallum, A., and Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA, 2001), ICML '01, Morgan Kaufmann Publishers Inc., pp. 282--289. Google Scholar
Digital Library
- Michener, W., Vieglais, D., Vision, T., Kunze, J., Cruse, P., and Janée, G. DataONE: Data Observation Network for Earth-preserving Data and Enabling Innovation in the Biological and Environmental Sciences. D-Lib Magazine 17, 1 (2011), 3.Google Scholar
- Ram, K. Git can Facilitate Greater Reproducibility and Increased Transparency in Science. Source Code for Biology and Medicine 8, 1 (2013), 7.Google Scholar
- Slashdot. SorceForge: Find, Create, and Publish Open Source Software for Free. http://sourceforge.net/.Google Scholar
Index Terms
Towards automated prediction of relationships among scientific datasets




Comments