skip to main content
10.1145/1265530.1265538acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Decision trees for entity identification: approximation algorithms and hardness results

Published:11 June 2007Publication History

ABSTRACT

We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability distribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy algorithm and prove an approximation guarantee of O(rK • log N), where N is the number of entities and K is the maximum number of distinct values of an attribute. The value rK is a suitably defined Ramsey number, which is at most log K. We show that it is NP-hard to approximate the problem within a factor of Ω(log N), even for binary tables (i.e. K=2). Thus, for the case of binary tables, our approximation algorithm is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving a Ramsey-theoretic conjecture by Erdos.

Skip Supplemental Material Section

Supplemental Material

Low Resolution
High Resolution

References

  1. South Australia Environment Protection Authority. Frog identification keys. http://www.epa.sa.gov.au/frogcensus/frog key.html.Google ScholarGoogle Scholar
  2. A. Chaudhary and N. Gupta. Personal communication, 2007.Google ScholarGoogle Scholar
  3. F. Chung and C. Grinstead. A survey of bounds for classical Ramsey numbers. Journal of Graph Theory, 7:25--37, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Exoo. A lower bound for Schur numbers and multicolor Ramsey numbers. Electronic Journal of Combinatorics, 1(R8), 1994.Google ScholarGoogle ScholarCross RefCross Ref
  5. U. Feige, L. Lovász, and P. Tetali. Approximating min sum set cover. Algorithmica, 40(4):219--234, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Garey. Optimal binary decision trees for diagnostic identification problems. Ph.D. thesis, University of Wisconsin, Madison, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Garey. Optimal binary identification procedures. SIAM Journal on Applied Mathematics, 23(2):173--186, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Garey and R. Graham. Performance bounds on the splitting algorithm for binary testing. Acta Informatica, 3:347--355, 1974.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Graham, B. Rothschild, and J. Spencer. Ramsey theory. John Wiley & Sons, New York, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  10. B. Heeringa. Improving Access to Organized Information. Ph.D. thesis, University of Massachusetts, Amherst, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Heeringa and M. Adler. Approximating optimal decision trees. TR 05-25, University of Massachusetts, Amherst, 2005.Google ScholarGoogle Scholar
  12. L. Hyafil and R. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15--17, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  13. B. Moret. Decision trees and diagrams. ACM Computing Surveys, 14(4):593--623, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2(4):345--389, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Nesetril and M. Rosenfeld. I. Schur, C. E. Shannon and Ramsey numbers, a short story. Discrete Mathematics, 229(1-3):185--195, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. University of California, Davis. Guide to healthy lawns: Identification key to weeds. http://www.ipm.ucdavis.edu/TOOLS/TURF/PESTS/weedkey.html.Google ScholarGoogle Scholar
  17. Florida Museum of Natural History, University of Florida. Layman's key to the snakes of florida. http://www.flmnh.ufl.edu/herpetology/FL-GUIDE/snakekey.htm.Google ScholarGoogle Scholar
  18. The Stream Project, University of Virginia. Aquatic macroinvertebrate identification key. http://wsrv.clas.virginia.edu/~sos-iwla/Stream-Study/Key/MacroKeyIntro.HTML.Google ScholarGoogle Scholar
  19. R. Pankhurst. A computer program for generating diagnostic keys. The Computer Journal, 132):145--151, 1970.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Radziszowski. Small Ramsey numbers. Electronic Journal of Combinatorics, 1(#7), 1994.Google ScholarGoogle Scholar
  21. Ran Raz and Shmuel Safra. A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In Proc. of the 29th ACM Symposium on Theory of Computing, pages 475--484, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Reynolds, J. Dicks, I. Roberts, J. Wesselink, B. Iglesia, V. Robert, T. Boekhout, and V. Rayward-Smith. Algorithms for identification key generation and optimization with application to yeast identification. In EvoWorkshops, LNCS 2611, pages 107--118, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Schur. Uber die kongruenz xm + ym = zm (mod p). Jber. Deustsch. Math. Verein, 25:114--117, 1916.Google ScholarGoogle Scholar
  24. D. West. Introduction to Graph Theory. Prentice Hall, 2001.Google ScholarGoogle Scholar
  25. T. Wijtzes, M. Bruggeman, M. Nout, and M. Zwietering. A computer system for identification of lactic acid bacteria. International Journal of Food Microbiology, 38(1):65--70, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  26. Wikipedia. Identification key -- Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/wiki/Dichotomous key.Google ScholarGoogle Scholar

Index Terms

  1. Decision trees for entity identification: approximation algorithms and hardness results

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2007
        328 pages
        ISBN:9781595936851
        DOI:10.1145/1265530

        Copyright © 2007 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 June 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate476of1,835submissions,26%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!