ABSTRACT
A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type.
In this paper, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded tree-width (trees being a special case). Moreover, each class has two variants: the one in which the nodes are unlabeled, and the one in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees, and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?
- N. Alon and A. Shapira. Every monotone graph property is testable. SIAM J. Comput., 38(2):505--522, 2008. Google Scholar
Digital Library
- L. Babai and E. M. Luks. Canonical labeling of graphs. In STOC, pages 171--183. ACM, 1983. Google Scholar
Digital Library
- E. Boros, V. Gurvich, L. Khachiyan, and K. Makino. On maximal frequent and minimal infrequent sets in binary matrices. Ann. Math. Artif. Intell., 39(3):211--221, 2003. Google Scholar
Digital Library
- M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Eng., 17(8):1036--1050, 2005. Google Scholar
Digital Library
- R. G. Downey and M. R. Fellows. Parameterized Complexity. Monographs in Computer Science. Springer, 1999. Google Scholar
Digital Library
- E. Friedgut and G. Kalai. Every monotone graph property has a sharp threshold. Proc. Amer. Math. Soc., 124(10):2993--3002, 1996.Google Scholar
Cross Ref
- M. R. Garey, D. S. Johnson, and R. E. Tarjan. The planar Hamiltonian circuit problem is NP-complete. SIAM J. Comput., 5(4):704--714, 1976.Google Scholar
Digital Library
- G. Greco, A. Guzzo, G. Manco, and D. Saccà. Mining and reasoning on workflows. IEEE Trans. Knowl. Data Eng., 17(4):519--534, 2005. Google Scholar
Digital Library
- G. Greco, A. Guzzo, G. Manco, and D. Saccà. Mining unconnected patterns in workflows. Inf. Syst., 32(5):685--712, 2007. Google Scholar
Digital Library
- E. Gudes, S. E. Shimony, and N. Vanetik. Discovering frequent graph patterns using disjoint paths. IEEE Trans. Knowl. Data Eng., 18(11):1441--1456, 2006. Google Scholar
Digital Library
- D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharm. Discovering all most specific sentences. ACM Trans. Database Syst., 28(2):140--174, 2003. Google Scholar
Digital Library
- J. E. Hopcroft and R. E. Tarjan. Isomorphism of planar graphs. In Complexity of Computer Computations, The IBM Research Symposia Series, pages 131--152. Plenum Press, New York, 1972.Google Scholar
Cross Ref
- J. Huan, W. Wang, J. Prins, and J. Yang. SPIN: mining maximal frequent subgraphs from graph databases. In KDD, pages 581--586, 2004. Google Scholar
Digital Library
- A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, pages 13--23, 2000. Google Scholar
Digital Library
- A. Inokuchi, T. Washio, and H. Motoda. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321--354, 2003. Google Scholar
Digital Library
- D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27:119--123, 1988. Google Scholar
Digital Library
- L. Khachiyan, E. Boros, K. Borys, K. M. Elbassioni, and V. Gurvich. Generating all vertices of a polyhedron is hard. Discrete & Computational Geometry, 39(1--3):174--190, 2008. Google Scholar
Digital Library
- B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173--182. ACM, 2006. Google Scholar
Digital Library
- B. Kimelfeld and Y. Sagiv. Maximally joining probabilistic data. In PODS, pages 303--312. ACM, 2007. Google Scholar
Digital Library
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM, pages 313--320, 2001. Google Scholar
Digital Library
- M. Kuramochi and G. Karypis. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Eng., 16(9):1038--1051, 2004. Google Scholar
Digital Library
- K. Makino and T. Ibaraki. Interior and exterior functions of boolean functions. Discrete Applied Mathematics, 69(3):209--231, 1996. Google Scholar
Digital Library
- J. Matousek and R. Thomas. On the complexity of finding iso- and other morphisms for partial k-trees. Discrete Mathematics, 108(1--3):343--364, 1992. Google Scholar
Digital Library
- R. J. Mooney, P. Melville, L. R. Tang, J. Shavlik, I. Dutra, and D. Page. Relational data mining with inductive logic programming for link discovery. Data Mining: Next Generation Challenges and Future Directions, pages 239--254, 2004.Google Scholar
- S. Nijssen and J. N. Kok. Frequent graph mining and its application to molecular databases. In SMC (5), pages 4571--4577. IEEE, 2004.Google Scholar
- Y. Okamoto, T. Uno, and R. Uehara. Counting the number of independent sets in chordal graphs. J. Discrete Algorithms, 6(2):229--242, 2008. Google Scholar
Digital Library
- J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.Google Scholar
- A. Stoica and C. Prieur. Structure of neighborhoods in a large social network. In CSE (4), pages 26--33. IEEE Computer Society, 2009. Google Scholar
Digital Library
- L. T. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. TKDD, 4(3), 2010. Google Scholar
Digital Library
- S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2):316--328, 1992. Google Scholar
Digital Library
- L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.Google Scholar
Cross Ref
- F. Wagner. Graphs of bounded treewidth can be canonized in AC1. In CSR, volume 6651 of Lecture Notes in Computer Science, pages 209--222. Springer, 2011. Google Scholar
Digital Library
- J. Wang, J. Han, and J. Pei. CLOSET: searching for the best strategies for mining frequent closed itemsets. In KDD, pages 236--245, 2003. Google Scholar
Digital Library
- X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In ICDM, pages 721--724, 2002. Google Scholar
Digital Library
- X. Yan and J. Han. CloseGraph: mining closed frequent graph patterns. In KDD, pages 286--295, 2003. Google Scholar
Digital Library
- G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In KDD, pages 344--353. ACM, 2004. Google Scholar
Digital Library
- M. Yannakakis. Algorithms for acyclic database schemes. In VLDB, pages 82--94. IEEE Computer Society, 1981. Google Scholar
Digital Library
- M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In SDM, 2002.Google Scholar
Cross Ref
Index Terms
The complexity of mining maximal frequent subgraphs
Recommendations
The Complexity of Mining Maximal Frequent Subgraphs
Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from ...
Improvised Apriori with frequent subgraph tree for extracting frequent subgraphs
Soft computing and intelligent systems: Tools, techniques and applicationsGraphs are considered to be one of the best studied data structures in discrete mathematics and computer science. Hence, data mining on graphs has become quite popular in the past few years. The problem of finding frequent itemsets in conventional data ...
Efficient frequent connected subgraph mining in graphs of bounded tree-width
The frequent connected subgraph mining problem, i.e., the problem of listing all connected graphs that are subgraph isomorphic to at least a certain number of transaction graphs of a database, cannot be solved in output polynomial time in the general ...






Comments