10.1145/2791347.2791358acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Towards automated prediction of relationships among scientific datasets

Published:29 June 2015Publication History

ABSTRACT

Before scientists can analyze, publish, or share their data, they often need to determine how their datasets are related. Determining relationships helps scientists identify the most complete version of a dataset, detect versions of datasets that complement each other, and determine multiple datasets that overlap. In previous work, we showed how observable relationships between two datasets help scientists recall their original derivation connection. While that work helped with identifying relationships between two datasets, it is infeasible for scientists to use it for finding relationships between all possible pairs in a large collection of datasets. In order to deal with larger numbers of datasets, we are extending our methodology with a relationship-prediction system, ReDiscover, a tool to identify pairs from a collection of datasets that are most likely related and the relationship between them. We report on the initial design of ReDiscover, which uses machine-learning methods such as Conditional Random Fields and Support Vector Machines to the relationship-discovery problem. Our preliminarily evaluation shows that ReDiscover predicted relationships with an average accuracy of 87%.

References

  1. Alawini, A., Maier, D., Tufte, K., and Howe, B. Helping Scientists Reconnect their Datasets. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (2014), ACM, pp. 29:1--29:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chen, Z., Cafarella, M., Chen, J., Prevo, D., and Zhuang, J. SENBAZURU: A Prototype Spreadsheet Database Management System. Proceedings of the VLDB Endowment 6, 12 (2013), 1202--1205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cortes, C., and Vapnik, V. Support-vector Networks. Machine learning 20, 3 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cudre-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D. L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., and Zdonik, S. A Demonstration of SciDB: A Science-oriented DBMS. The Proceedings of the VLDB Endowment (PVLDB) 2, 2 (Aug. 2009), 1534--1537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cunha, J., Saraiva, J. a., and Visser, J. From Spreadsheets to Relational Databases and Back. In Proceedings of the 2009 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation (New York, NY, USA, 2009), PEPM '09, ACM, pp. 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dong, B., Byna, S., and Wu, K. SDS: A Framework for Scientific Data Services. In Proceedings of the 8th Parallel Data Storage Workshop (New York, NY, USA, 2013), PDSW '13, ACM, pp. 27--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fisher, M., and Rothermel, G. the EUSES Spreadsheet Corpus: a Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In ACM SIGSOFT Software Engineering Notes (2005), vol. 30, ACM, pp. 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. GitHub. GitHub: Smarter Version Control. https://github.com/.Google ScholarGoogle Scholar
  9. Gonzalez, H., Halevy, A., Jensen, C., Langen, A., Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), ACM, pp. 1061--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Howe, B., Cole, G., Souroush, E., Koutris, P., Key, A., Khoussainova, N., and Battle, L. Database-as-a-Service for Long-Tail Science. In Scientific and Statistical Database Management, vol. 6809 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2011, pp. 480--489. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lafferty, J., McCallum, A., and Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (San Francisco, CA, USA, 2001), ICML '01, Morgan Kaufmann Publishers Inc., pp. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michener, W., Vieglais, D., Vision, T., Kunze, J., Cruse, P., and Janée, G. DataONE: Data Observation Network for Earth-preserving Data and Enabling Innovation in the Biological and Environmental Sciences. D-Lib Magazine 17, 1 (2011), 3.Google ScholarGoogle Scholar
  13. Ram, K. Git can Facilitate Greater Reproducibility and Increased Transparency in Science. Source Code for Biology and Medicine 8, 1 (2013), 7.Google ScholarGoogle Scholar
  14. Slashdot. SorceForge: Find, Create, and Publish Open Source Software for Free. http://sourceforge.net/.Google ScholarGoogle Scholar

Index Terms

  1. Towards automated prediction of relationships among scientific datasets

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!