ABSTRACT
We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability distribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy algorithm and prove an approximation guarantee of O(rK • log N), where N is the number of entities and K is the maximum number of distinct values of an attribute. The value rK is a suitably defined Ramsey number, which is at most log K. We show that it is NP-hard to approximate the problem within a factor of Ω(log N), even for binary tables (i.e. K=2). Thus, for the case of binary tables, our approximation algorithm is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving a Ramsey-theoretic conjecture by Erdos.
Supplemental Material
- South Australia Environment Protection Authority. Frog identification keys. http://www.epa.sa.gov.au/frogcensus/frog key.html.Google Scholar
- A. Chaudhary and N. Gupta. Personal communication, 2007.Google Scholar
- F. Chung and C. Grinstead. A survey of bounds for classical Ramsey numbers. Journal of Graph Theory, 7:25--37, 1983.Google Scholar
Cross Ref
- G. Exoo. A lower bound for Schur numbers and multicolor Ramsey numbers. Electronic Journal of Combinatorics, 1(R8), 1994.Google Scholar
Cross Ref
- U. Feige, L. Lovász, and P. Tetali. Approximating min sum set cover. Algorithmica, 40(4):219--234, 2004. Google Scholar
Digital Library
- M. Garey. Optimal binary decision trees for diagnostic identification problems. Ph.D. thesis, University of Wisconsin, Madison, 1970. Google Scholar
Digital Library
- M. Garey. Optimal binary identification procedures. SIAM Journal on Applied Mathematics, 23(2):173--186, 1972.Google Scholar
Cross Ref
- M. Garey and R. Graham. Performance bounds on the splitting algorithm for binary testing. Acta Informatica, 3:347--355, 1974.Google Scholar
Digital Library
- R. Graham, B. Rothschild, and J. Spencer. Ramsey theory. John Wiley & Sons, New York, 1990.Google Scholar
Cross Ref
- B. Heeringa. Improving Access to Organized Information. Ph.D. thesis, University of Massachusetts, Amherst, 2006. Google Scholar
Digital Library
- B. Heeringa and M. Adler. Approximating optimal decision trees. TR 05-25, University of Massachusetts, Amherst, 2005.Google Scholar
- L. Hyafil and R. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15--17, 1976.Google Scholar
Cross Ref
- B. Moret. Decision trees and diagrams. ACM Computing Surveys, 14(4):593--623, 1982. Google Scholar
Digital Library
- S. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2(4):345--389, 1998. Google Scholar
Digital Library
- J. Nesetril and M. Rosenfeld. I. Schur, C. E. Shannon and Ramsey numbers, a short story. Discrete Mathematics, 229(1-3):185--195, 2001. Google Scholar
Digital Library
- University of California, Davis. Guide to healthy lawns: Identification key to weeds. http://www.ipm.ucdavis.edu/TOOLS/TURF/PESTS/weedkey.html.Google Scholar
- Florida Museum of Natural History, University of Florida. Layman's key to the snakes of florida. http://www.flmnh.ufl.edu/herpetology/FL-GUIDE/snakekey.htm.Google Scholar
- The Stream Project, University of Virginia. Aquatic macroinvertebrate identification key. http://wsrv.clas.virginia.edu/~sos-iwla/Stream-Study/Key/MacroKeyIntro.HTML.Google Scholar
- R. Pankhurst. A computer program for generating diagnostic keys. The Computer Journal, 132):145--151, 1970.Google Scholar
Cross Ref
- S. Radziszowski. Small Ramsey numbers. Electronic Journal of Combinatorics, 1(#7), 1994.Google Scholar
- Ran Raz and Shmuel Safra. A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In Proc. of the 29th ACM Symposium on Theory of Computing, pages 475--484, 1997. Google Scholar
Digital Library
- A. Reynolds, J. Dicks, I. Roberts, J. Wesselink, B. Iglesia, V. Robert, T. Boekhout, and V. Rayward-Smith. Algorithms for identification key generation and optimization with application to yeast identification. In EvoWorkshops, LNCS 2611, pages 107--118, 2003. Google Scholar
Digital Library
- I. Schur. Uber die kongruenz xm + ym = zm (mod p). Jber. Deustsch. Math. Verein, 25:114--117, 1916.Google Scholar
- D. West. Introduction to Graph Theory. Prentice Hall, 2001.Google Scholar
- T. Wijtzes, M. Bruggeman, M. Nout, and M. Zwietering. A computer system for identification of lactic acid bacteria. International Journal of Food Microbiology, 38(1):65--70, 1997.Google Scholar
Cross Ref
- Wikipedia. Identification key -- Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/wiki/Dichotomous key.Google Scholar
Index Terms
Decision trees for entity identification: approximation algorithms and hardness results
Recommendations
Decision trees for entity identification: Approximation algorithms and hardness results
We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the ...
Lower Bounds on Learning Decision Lists and Trees
k-Decision lists and decision trees play important roles in learning theory as well as in practical learning systems.k-Decision lists generalize classes such as monomials,k-DNF, andk-CNF, and like these subclasses they are polynomially PAC-learnable R. ...
Decision tree induction with a constrained number of leaf nodes
With the advantages of being easy to understand and efficient to compute, the decision tree method has long been one of the most popular classifiers. Decision trees constructed with existing approaches, however, tend to be huge and complex, and ...






Comments