Abstract
Awareness of data and information quality issues has grown rapidly in light of the critical role played by the quality of information in our data-intensive, knowledge-based economy. Research in the past two decades has produced a large body of data quality knowledge and has expanded our ability to solve many data and information quality problems. In this article, we present an overview of the evolution and current landscape of data and information quality research. We introduce a framework to characterize the research along two dimensions: topics and methods. Representative papers are cited for purposes of illustrating the issues addressed and the methods used. We also identify and discuss challenges to be addressed in future research.
References
- Abdel-Hamid, T. K. 1988. The economics of software quality assurance: A simulation-based case study. MIS Quart. 12, 3, 395--411. Google Scholar
Digital Library
- Abdel-Hamid, T. K. and Madnick, S. E. 1990. Dynamics of Software Project Management. Prentice-Hall, Englewood Cliffs, NJ.Google Scholar
- Ang, W. H., Lee, Y. W., Madnick, S. E., Mistress, D., Siegel, M., Strong, D. M., Wang, R. Y., and Yao, C. 2006. House of security: Locale, roles, resources for ensuring information security. In Proceedings of the 12th Americas Conference on Information Systems.Google Scholar
- Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based quality estimation of query results in relational database environments. IEEE Trans. Knowl. Data Eng. 18, 5, 639--650. Google Scholar
Digital Library
- Ballou, D. and Pazer, H. 1995. Designing information systems to optimize accuracy-timeliness trade-off. Inf. Syst. Res. 6, 1, 51--72.Google Scholar
Digital Library
- Ballou, D. and Tayi, G. K. 1999. Enhancing data quality in data warehouse environments. Commun. ACM 41, 1, 73--78. Google Scholar
Digital Library
- Ballou, D., Wang, R. Y., Pazer, H., and Tayi, G. K. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 4, 462--484. Google Scholar
Digital Library
- Baskerville, R. and Wood-Harper, A. T. 1996. A critical perspective on action research as a method for information systems research. J. Inf. Technol. 11, 235--246.Google Scholar
Cross Ref
- Batini, C., Lenzerini, M., and Navathe, S. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364. Google Scholar
Digital Library
- Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies, and Techniques. Springer Verlag. Google Scholar
Digital Library
- Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the 32nd VLDB Conference, 935--964. Google Scholar
Digital Library
- Bovee, M., Ettredge, M. L., Srivastava, R. P., and Vasarhelyi, M. A. 2002. Does the year 2000 XBRL taxonomy accommodate current business financial-reporting practice? J. Inf. Syst. 16, 2, 165--182.Google Scholar
Cross Ref
- Buneman, P., Chapman, A., and Cheney, J. 2006. Provenance management in curated databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 539--550. Google Scholar
Digital Library
- Buneman, P., Khanna, S., and Tan, W. C. 2001. Why and where: A characterization of data provenance. In International Conference on Database Theory, J. Van den Bussche and V. Vianu, Eds. Lecture Notes in Computer Science, vol. 1973. Springer, 316--330. Google Scholar
Digital Library
- Chen, P. P. 1976. The entity-relationship model: Toward a unified view of data. ACM Trans. Database Syst. 1, 1, 1--36. Google Scholar
Digital Library
- Chengular-Smith, I., Ballou, D. P., and Pazer, H. L. 1999. The impact of data quality information on decision making: An exploratory analysis. IEEE Trans. Knowl. Data Eng. 11, 6, 853--865. Google Scholar
Digital Library
- Dalvi, N. and Suciu, D. 2007. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 1--12. Google Scholar
Digital Library
- Dasgupta, P. and Stiglitz, J. 1980. Uncertainty, industrial structure, and the speed of R&D. The Bell J. Econom. 11, 1, 1--28.Google Scholar
Cross Ref
- Dasu, T. and Johnson, T. 2003. Exploratory Data Minding and Data Cleaning. John Wiley & Sons, Hoboken, NJ. Google Scholar
Digital Library
- Davidson, B., Lee, Y. W., and Wang, R. 2004. Developing data production maps: Meeting patient discharge data 1.Google Scholar
- De Vany, S. and Saving, T. 1983. The economics of quality. The J. Political Econ. 91, 6, 979--1000.Google Scholar
Cross Ref
- Deming, W. E. 1982. Out of the Crisis. MIT Press, Cambridge, MA.Google Scholar
- Doan, A., Domingos, P., and Halevy, A. 2001. Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference, 509--520. Google Scholar
Digital Library
- Doan, A. and Halevy, A. Y. 2005. Semantic-Integration research in the database community: A brief survey. AI Mag. 26, 1, 83--94. Google Scholar
Digital Library
- Fagin, R., Kolaitis, P. G., Miller, R., and Popa, L. 2005. Data exchange: Semantics and query answering. Theoretical Comput. Sci. 336, 1, 89--124. Google Scholar
Digital Library
- Fan, W., Lu, H., Madnick, S. E., and Cheung, D. W. 2001. Discovering and reconciling data value conflicts for numerical data integration. Inf. Syst. 26, 8, 635--656. Google Scholar
Digital Library
- Fisher, C., Chengular-Smith, I., and Ballou, D. 2003. The impact of experience and time on the use of data quality information in decision making. Inf. Syst. Res. 14, 2, 170--188. Google Scholar
Digital Library
- Fisher, C. and Kingma, B. 2001. Criticality of data quality as exemplified in two disasters. Inf. Manag. 39, 109--116. Google Scholar
Digital Library
- Flyvbjerg, B. 2006. Five misunderstandings about case study research. Qualitative Inquiry 12, 2, 219--245.Google Scholar
Cross Ref
- Frawley, W. J., Piateksky-Shapiro, G., and Matheu S, C. J. 1992. Knowledge discovery in databases: An overview. AI Mag. 13, 3, 57--70. Google Scholar
Digital Library
- Galahards, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the 27th VLDB Conference, 371--380. Google Scholar
Digital Library
- Goh, C. H., Bressan, S., Madnick, S. E., and Siegel, M. D. 1999. Context interchange: New features and formalisms for the intelligent integration of information. ACM Trans. Inf. Syst. 17, 3, 270--293 Google Scholar
Digital Library
- He, B., Chang, K. C. C., and Han, J. 2004. Mining complex matchings across Web query interfaces. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 3--10. Google Scholar
Digital Library
- Herbert, K. G., Gehani, N. H., Piel, W. H., Wang, J. T. L., and Wu, C. H. 2004. BIO-AJAX: An extensible framework for biological data cleaning. SIGMOD Rec. 33, 2, 51--57. Google Scholar
Digital Library
- Hernandez, M. and Stolfo. 1998. Real-World data is dirty: Data cleansing and the merge/purge problem. J. Data Mining Knowl. Discov. 2, 1, 9--37. Google Scholar
Digital Library
- Hevner, A. T., March, S. T., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google Scholar
Digital Library
- Jarke, M., Jeusfeld, M. A., Quix, C., and Vassiliadis, P. 1999. Architecture and quality in data warehouse: An extended repository approach. Inf. Syst. 24, 3, 229--253.Google Scholar
Cross Ref
- Jung, W., Olfman, L., Ryan, T., and Park, Y. 2005. An experimental study of the effects of contextual data quality and task complexity on decision performance. In Proceedings of the IEEE International Conference on Information Reuse and Integration, 149--154.Google Scholar
- Juran, J. and Goferey, A. B. 1999. Juran’s Quality Handbook. 5th ed. McGraw-Hill, New York.Google Scholar
- Kaomea, P. and Page, W. 1997. A flexible information manufacturing system for the generation of tailored information products. Decision Support Syst. 20, 4, 345--355. Google Scholar
Digital Library
- Kerr, K. 2006. The institutionalization of data quality in the New Zealand health sector. Ph.D. dissertation, The University of Auckland, New Zealand.Google Scholar
- Klein, B. D. and Rossin, D. F. 1999. Data quality in neural network models: Effect of error rate and magnitude of error on predictive accuracy. Omega 27, 5, 569--582.Google Scholar
Cross Ref
- Lee, Y. W. 2004. Crafting rules: Context-reflective data quality problem solving. J. Manag. Inf. Syst. 20, 3, 93--119. Google Scholar
Digital Library
- Lee, Y. W., Chase, S., Fisher, J., Leinung, A., McDowell, D., Paradiso, M., Simons, J., and Yarawich, C. 2007a. CEIP maps: Context-Embedded information product maps. In Proceedings of Americas’ Conference on Information Systems.Google Scholar
- Lee, Y. W., Pierce, E., Talburt, J., Wang, R. Y., and Zhu, H. 2007b. A curriculum for a master of science in information quality. J. Inf. Syst. Educ. 18, 2.Google Scholar
- Lee, Y. W., Pipino, L. L., Fund, J. F., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Lee, Y. W., Pipino, L., Strong, D., and Wang, R. 2004. Process embedded data integrity. J. Database Manag. 15, 1, 87--103.Google Scholar
Cross Ref
- Lee, Y. and Strong, D. 2004. Knowing-Why about data processes and data quality. J. Manag. Inf. Syst. 20, 3, 13--39. Google Scholar
Digital Library
- Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146. Google Scholar
Digital Library
- Li, X. B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270. Google Scholar
Digital Library
- Madnick, S. and Prat, N. 2008. Measuring data believability: A provenance approach. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences. Google Scholar
Digital Library
- Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM) research program. TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.Google Scholar
- Madnick, S. E., Wang, R. Y., Dravis, F., and Chen, X. 2001. Improving the quality of corporate household data: Current practices and research directions. In Proceedings of the 6th International Conference on Information Quality, 92--104Google Scholar
- Madnick, S. E., Wang, R. Y., Krishna, C., Dravis, F., Funk, J., Katz-Hass, R., Lee, C., Lee, Y., Xiam, X., and Bhansali, S. 2005. Exemplifying business opportunities for improving data quality from corporate household research. In Information Quality. R. Y. Wang et al., Eds. M. E. Sharpe, Armonk, NY, 181--196.Google Scholar
- Madnick, S. E., Wang, R. Y., and Xian, X. 2004. The design and implementation of a corporate householding knowledge processor to improve data quality. J. Manag. Inf. Syst. 20, 3, 41--69. Google Scholar
Digital Library
- Madnick, S. E. and Zhu, H. 2006. Improving data quality with effective use of data semantics. Data Knowl. Eng. 59, 2, 460--475. Google Scholar
Digital Library
- Marco, D., Duate-Melo, E., Liu, M., and Neuhoffand, D. 2003. On the many-to-one transport capacity of a dense wireless sensor network and the compressibility of its data. In Information Processing in Sensor Networks. In Goos et al., Eds. Lecture Notes in Computer Science, vol. 2634, Springer Berlin, 556. Google Scholar
Digital Library
- Mikkelsen, G. and Aasly, J. 2005. Consequences of impaired data quality on information retrieval in electronic patient records. Int. J. Med. Inf. 74, 5, 387--394.Google Scholar
Cross Ref
- Myers, M. D. 1997. Qualitative research in information systems. http://www.misq.org/discovery/MISQD_isworld/index.html (retrieved on October 5, 2007). Google Scholar
Digital Library
- O’Callaghan, L., Mishira, N., Meyerson, A., Guha, S., and Motwaniha, R. 2002. In Proceedings of the 18th International Conference on Data and Engineering, 685--694.Google Scholar
- OMB (Office of Management & Budget). 2007. FEA reference models. http://www.whitehouse.gov/omb/egov/a-2-EAModelsNEW2.html. (retrieved on October 5, 2007).Google Scholar
- Øvretveit, J. 2000. The economics of quality -- A practical approach. Int. J. Health Care Quality Assurance 13, 5, 200--207.Google Scholar
Cross Ref
- Petrovskiy, M. I. 2003. Outlier detection algorithms in data mining systems. Program. Comput. Softw. 29, 4, 228--237. Google Scholar
Digital Library
- Pierce, E. M. 2004. Assessing data quality with control matrices. Commun. ACM 47, 2, 82--86. Google Scholar
Digital Library
- Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218. Google Scholar
Digital Library
- Raghunathan, S. 1999. Impact of information quality and decision-making quality on decision quality: A theoretical model. Decision Support Syst. 25, 4, 275--287. Google Scholar
Digital Library
- Rahm, E. and Bernstein, P. 2001. On matching schemas automatically. VLDB J. 10, 4, 334--350.Google Scholar
Digital Library
- Redman, T. C. 1998. The impact of poor data quality on the typical enterprise. Commun. ACM 41, 2, 79--82. Google Scholar
Digital Library
- Schekkerman, J. 2004. How to Survive in the Jungle of Enterprise Architecture Frameworks: Creating or Choosing an Enterprise Architecture Framework. Trafford Publishing. Google Scholar
Digital Library
- Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing data quality in dynamic decision environment: An information product approach. J. Database Manag. 14, 4, 14--32.Google Scholar
Cross Ref
- Sheng, Y. and Mykytyn, P. 2002. Information technology investment and firm performance: A perspective of data quality. In Proceedings of the 7th International Conference on Information Quality, 132--141.Google Scholar
- Slone, J. P. 2006. Information quality strategy: An empirical investigation of the relationship between information quality improvements and organizational outcomes. Ph.D. dissertation, Capella University.Google Scholar
- Storey, V. and Wang, R. Y. 1998. Modeling quality requirements in conceptual database design. In Proceedings of the International Conference on Information Quality, 64--87Google Scholar
- Strong, D., Lee, Y. W., and Wang, R. Y. 1997. Data quality in context. Commun. ACM 40, 5, 103--110. Google Scholar
Digital Library
- Talburt, J., Morgan, C., Talley, T., and Archer, K. 2005. Using commercial data integration technologies to improve the quality of anonymous entity resolution in the public sector. In Proceedings of the 10th International Conference on Information Quality (ICIQ’05), 133--142.Google Scholar
- Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules from information extraction. Inf. Syst. 26, 8, 607--633. Google Scholar
Digital Library
- Thatcher, M. E. and Pingry, D. E. 2004. An economic model of product quality and IT value. Inf. Syst. Res. 15, 3, 268--286. Google Scholar
Digital Library
- Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., and Sellis, T. 2001. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Inf. Syst. 26, 537--561. Google Scholar
Digital Library
- Wang, R. Y., Kon, H. B., and Madnick, S. E. 1993. Data quality requirements analysis and modeling. In Proceedings of the 9th International Conference of Data Engineering, 670--677. Google Scholar
Digital Library
- Wang, R. Y., Lee, Y., Pipino, L., and Strong, D. 1998. Managing your information as a product. Sloan Manag. Rev. Summer 1998, 95--106.Google Scholar
- Wang, R. Y. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of the 5th International Conference on Data Engineering, 46--55. Google Scholar
Digital Library
- Wang, R. Y. and Madnick, S. E. 1990. A polygen model for heterogeneous database systems: The source tagging perspective. In Proceedings of the 16th VLDB Conference, 519--538. Google Scholar
Digital Library
- Wang, R. Y., Reddy, M., and Kon, H. 1995a. Toward quality data: An attribute-based approach. Decision Support Syst. 13, 349--372. Google Scholar
Digital Library
- Wang, R. Y., Storey, V. C., and Firth, C. P. 1995b. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7, 4, 623--640. Google Scholar
Digital Library
- Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4, 5--34. Google Scholar
Digital Library
- Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05).Google Scholar
- Winkler, W. E. 2006. Overview of record linkage and current research directions. Tech. rep. U.S. Census Bureau, Statistics #2006-2.Google Scholar
- Xiao, X. and Tao, Y. 2006. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd VLDB Conference. Google Scholar
Digital Library
- Xu H., Nord, J. H., Brown, N., and Nord, G. G. 2002. Data quality issues in implementing an ERP. Industrial Manag. Data Syst. 102, 1, 47--58.Google Scholar
Cross Ref
- Yin, R. 2002. Case Study Research: Design and Methods, 3rd ed. Sage Publications, Thousand Oaks, CA.Google Scholar
- Zachman, J. A. 1987. A framework for information systems architecture. IBM Syst. J. 26, 3, 276--292. Google Scholar
Digital Library
- Zhu, X., Khoshgoftaar, T., Davidson, I., and Zhang, S. 2007. Editorial: Special issue on mining low-quality data. Knowl. Inf. Syst. 11, 2, 131--136. Google Scholar
Digital Library
Index Terms
Overview and Framework for Data and Information Quality Research





Comments