ABSTRACT
Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries -- dictionaries, encyclopedias, gazetteers etc. -- are now curated databases. Since it is now easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. The value of curated databases lies in the organization and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of some subject area.
Curated databases present a number of challenges for database research. The topics of annotation, provenance, and citation are central, because curated databases are heavily cross-referenced with, and include data from, other databases, and much of the work of a curator is annotating existing data. Evolution of structure is important because these databases often evolve from semistructured representations, and because they have to accommodate new scientific discoveries. Much of the work in these areas is in its infancy, but it is beginning to provide suggest new research for both theory and practice. We discuss some of this research and emphasize the need to find appropriate models of the processes associated with curated databases.
Supplemental Material
- C. Aravindan and P. Baumgartner. Theorem proving techniques for view deletion in databases. J. Symb. Comput., 29(2):119--147, 2000.]] Google Scholar
Digital Library
- A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement trEMBL. Nucleic Acids Research, 25(1):31--36, 1997.]]Google Scholar
Cross Ref
- V. Benzaken, G. Castagna, and A. Frisch. CDuce: an XML-centric general-purpose language. In ICFP 2003, pages 51--63. ACM, 2003.]] Google Scholar
Digital Library
- G. J. Bex, W. Gelada, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. In WWW 2008, 2008.]] Google Scholar
Digital Library
- G. J. Bex, F. Neven, and J. V. den Bussche. DTDs versus XML Schema: a practical study. In WebDB 2004, pages 79--84, New York, NY, USA, 2004. ACM.]] Google Scholar
Digital Library
- G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of concise DTDs from XML data. In VLDB 2006, pages 115--126, 2006.]] Google Scholar
Digital Library
- G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB 2007, pages 998--1009, 2007.]] Google Scholar
Digital Library
- G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB 2007, pages 998--1009, 2007.]] Google Scholar
Digital Library
- S. Bowers, L. Delcambre, and D. Maier. Enriching documents in an information portal using superimposed schematics. In dg.o '02: Proceedings of the 2002 annual national conference on Digital government research, pages 1--6. Digital Government Research Center, 2002.]] Google Scholar
Digital Library
- S. Bowers, T. McPhillips, B. Ludaescher, S. Cohen, and S. B. Davidson. A model for user-oriented data provenance in pipelined scientific workflows. In Moreau and Foster {59}, pages 133--147.]] Google Scholar
Digital Library
- R. J. Brachman and J. G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science, 9(2):171--216, 1985.]]Google Scholar
Cross Ref
- P. Buneman. How to cite curated databases and how to make them citable. In SSDBM 2006, pages 195--203. IEEE Computer Society, 2006.]] Google Scholar
Digital Library
- P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In SIGMOD 2006, pages 539--550, 2006.]] Google Scholar
Digital Library
- P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. In Database Theory - ICDT 2007, volume 4353 of LNCS, pages 209--223, 2007.]] Google Scholar
Digital Library
- P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. Tan. Keys for XML. Computer Networks, 39(5):473--487, 2002.]]Google Scholar
Cross Ref
- P. Buneman, S. Khanna, K. Tajima, and W. Tan. Archiving scientific data. ACM Trans. Database Syst., 27(1):2--42, 2004.]] Google Scholar
Digital Library
- P. Buneman, S. Khanna, and W. Tan. On the propagation of deletions and annotations through views. In PODS 2002, pages 150--158. ACM, 2002.]] Google Scholar
Digital Library
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In Database Theory - ICDT 2001, volume 1973 of LNCS, pages 316--330, 2001.]] Google Scholar
Digital Library
- P. Buneman, S. A. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comp. Sci., 149(1):3--48, 1995.]] Google Scholar
Digital Library
- Central Intelligence Agency. The world factbook. http://www.cia.gov/cia/publications/factbook/.]]Google Scholar
- A. Chapman and H. V. Jagadish. Issues in building practical provenance systems. IEEE Data Eng. Bull., 30(4):38--43, 2007.]]Google Scholar
- J. Cheney. Program slicing and data provenance. IEEE Data Eng. Bull., 30(4):22--28, 2007.]]Google Scholar
- J. Cheney. Lux: A lightweight, statically typed XML update language. In ACM SIGPLAN Workshop on Programming Language Technology and XML (PLAN-X 2007), pages 25--36, 2007.]]Google Scholar
- J. Cheney, A. Ahmed, and U. A. Acar. Provenance as dependency analysis. In Database Programming Languages - DBPL 2007, volume 4797 of LNCS, pages 139--153. Springer, 2007.]] Google Scholar
Digital Library
- L. Chiticariu and W. Tan. Debugging schema mappings with routes. In VLDB 2006, pages 79--90, 2006.]] Google Scholar
Digital Library
- L. Chiticariu, W. Tan, and G. Vijayvargiya. DBNotes: A post-it system for relational databases based on provenance. In SIGMOD 2005, pages 942--944, 2005. (Demonstration paper).]] Google Scholar
Digital Library
- G. Cong, W. Fan, and F. Geerts. Annotation propagation revisited for key preserving views. In CIKM 2006, pages 632--641. ACM, 2006.]] Google Scholar
Digital Library
- Y. Cui and J. Widom. Run-time translation of view tuple deletions using data lineage. Technical report, Stanford University, 2001.]]Google Scholar
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., 25(2):179--227, 2000.]] Google Scholar
Digital Library
- N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS 2007, pages 1--12. ACM, 2007.]] Google Scholar
Digital Library
- R. D. Dowell, R. M. Jokerst, A. Day, S. R. Eddy, and L. Stein. The distributed annotation system. BMC Bioinformatics, 2:7, 2001.]]Google Scholar
Cross Ref
- J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making Data Structures Persistent. J. Comput. Syst. Sci., 38(1):86--124, 1989.]] Google Scholar
Digital Library
- W. Fan. Dependencies revisited for improving data quality. In PODS 2008. ACM, June 2008. These proceedings.]] Google Scholar
Digital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL 2008, pages 421--434. ACM, 2008.]] Google Scholar
Digital Library
- J. N. Foster, T. Green, and V. Tannen. Annotated XML: Queries and provenance. In PODS 2008. ACM, June 2008. These proceedings.]] Google Scholar
Digital Library
- M. Y. Galperin. The molecular biology database collection: 2008 update. Nucleic Acids Research, 36, 2008.]]Google Scholar
- D. Gao and R. T. Snodgrass. Temporal slicing in the evaluation of XML queries. In VLDB 2003, pages 632--643, 2003.]] Google Scholar
Digital Library
- H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. J. Intell. Inf. Syst., 8:117--132, 1997.]] Google Scholar
Digital Library
- P. Gardner, G. Smith, M. Wheelhouse, and U. Zarfaty. Local hoare reasoning about DOM. In PODS 2008, June 2008. These proceedings.]] Google Scholar
Digital Library
- F. Geerts, A. Kementsietsidis, and D. Milano. MONDRIAN: Annotating and querying databases through colors and blocks. In ICDE 2006, page 82. IEEE Computer Society, 2006.]] Google Scholar
Digital Library
- F. Geerts and J. Van den Bussche. Relational completeness of query languages for annotated databases. In Database Programming Languages - DBPL 2007, volume 4797 of LNCS, pages 127--137, 2007.]] Google Scholar
Digital Library
- W. Gelade, W. Martens, and F. Neven. Optimizing schema languages for XML: Numerical constraints and interleaving. In Database Theory - ICDT 2007, volume 4353 of LNCS, pages 269--283. Springer, 2007.]] Google Scholar
Digital Library
- G. Ghelli, D. Colazzo, and C. Sartiani. Efficient inclusion for a class of XML types with interleaving and counting. In Database Programming Languages: DBPL 2007, volume 4797 of LNCS, pages 231--245, 2007.]] Google Scholar
Digital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS 2007, pages 31--40. ACM Press, 2007.]] Google Scholar
Digital Library
- J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. V. den Bussche. DFL: A dataflow language based on petri nets and nested relational calculus. Inf. Syst., 33(3):261--284, 2008.]] Google Scholar
Digital Library
- H. Hosoya and B. C. Pierce. XDuce: A statically typed xml processing language. ACM Trans. Interet Technol., 3(2):117--148, 2003.]] Google Scholar
Digital Library
- T. Imielinski and J. Witold Lipski. Incomplete information in relational databases. J. ACM, 31(4):761--791, 1984.]] Google Scholar
Digital Library
- IUPHAR receptor database. http://www.iuphar-db.org.]]Google Scholar
- S. Jones, D. Abbott, , and S. Ross. Risk Assessment for AHDS Performing Arts Collections: A Response to the Withdrawal of Core Funding. Technical report, Glasgow, December 2007.]]Google Scholar
- S. Kumar and T. Bednar. Oracle9i flashback query. Technical report, Oracle Corporation, 2001.]]Google Scholar
- T. Lee, S. Bressan, and S. E. Madnick. Source attribution for querying against semi-structured documents. In First Workshop on Web Information and Data Management, pages 33--39. ACM, 1998.]]Google Scholar
- H. Liefke and S. B. Davidson. Specifying updates in biomedical databases. In SSDBM 1999, pages 44--53. IEEE, 1999.]] Google Scholar
Digital Library
- D. Lomet, R. Barga, M. F. Mokbel, G. Shegalov, R. Wang, and Y. Zhu. Immortal DB: transaction time support for SQL server. In SIGMOD 2005, pages 939--941. ACM, 2005.]] Google Scholar
Digital Library
- B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, 18(10):1039--1065, 2006.]] Google Scholar
Digital Library
- P. Maniatis, M. Roussopoulos, T. J. Giuli, D. S. H. Rosenthal, and M. Baker. The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst., 23(1):2--50, 2005.]] Google Scholar
Digital Library
- A. J. Mayer and L. J. Stockmeyer. Word problems-this time with interleaving. Inf. Comput., 115(2):293--311, 1994.]] Google Scholar
Digital Library
- D. L. McGuinness, R. Fikes, J. Rice, and S. Wilder. The Chimaera ontology environment. In Proceedings of Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 1123--1124. AAAI Press, 2000.]] Google Scholar
Digital Library
- V. A. McKusick. OMIM - online mendelian inheritance in man. www.ncbi.nlm.nih.gov/omim/.]]Google Scholar
- L. Moreau and I. T. Foster, editors. Provenance and Annotation of Data - IPAW 2006, volume 4145 of LNCS. Springer, 2006.]] Google Scholar
Digital Library
- H. Müller, P. Buneman, and I. Koltsidas. XArch: Archiving scientific and reference data. In SIGMOD 2008, June 2008. Demonstration Paper. To appear.]] Google Scholar
Digital Library
- N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with Protege-2000. IEEE Intelligent Systems, 16(2):60--71, 2001.]] Google Scholar
Digital Library
- T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045--3054, 2004.]] Google Scholar
Digital Library
- Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator systems. In VLDB 1996, pages 413--424. Morgan Kaufmann, 1996.]] Google Scholar
Digital Library
- Plutarch. Vita Thesei 22-23.]]Google Scholar
- D. Rémy. Type inference for records in a natural extension of ML. In Theoretical aspects of object-oriented programming. MIT Press, 1994.]]Google Scholar
- A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE 2006, page 7. IEEE Computer Society, 2006.]] Google Scholar
Digital Library
- R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, July 1999.]] Google Scholar
Digital Library
- L. D. Stein and J. Thierry-Mieg. AceDB: A genome database management system. Computing in Science and Engg., 1(3):44--52, 1999.]] Google Scholar
Digital Library
- W. Tan. Containment of relational queries with annotation propagation. In Database Programming Languages - DBPL 2003, volume 2921 of LNCS, pages 37--53. Springer, 2003.]]Google Scholar
- The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25--29, May 2000.]]Google Scholar
Cross Ref
- F. Wang and C. Zaniolo. Temporal queries in XML document archives and web warehouses. In TIME, pages 47--55. IEEE Computer Society, 2003.]]Google Scholar
- Y. R. Wang and S. E. Madnick. A polygen model for heterogeneous database systems: The source tagging perspective. In VLDB 1990, pages 519--538. Morgan Kaufmann, 1990.]] Google Scholar
Digital Library
- M. Weiser. Program slicing. In ICSE, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.]] Google Scholar
Digital Library
- G. Yang, I. V. Ramakrishnan, and M. Kifer. On the complexity of schema inference from web pages in the presence of nullable data attributes. In CIKM 2003, pages 224--231. ACM, 2003.]] Google Scholar
Digital Library
Index Terms
Curated databases
Recommendations
Curated databases
ECDL'09: Proceedings of the 13th European conference on Research and advanced technology for digital librariesMost of our research and scholarship now depends on curated databases. A curated database is any kind of structured repository such as a traditional database, an ontology or an XML file, that is created and updated with a great deal of human effort. For ...
Provenance management in curated databases
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataCurated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such ...
Introducing a Text Annotation Tool (OntoMate), Assisting Curation at Rat Genome Database
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsIn model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend ...






Comments