ABSTRACT
This paper presents a methodology for expert-guided analysis of large data sets, including large text corpora. Its main ingredient is the algorithm for semi-supervised data clustering using cluster size constraints which implements several improvements over existing k-means constrained clustering algorithms. First, it allows for a larger set of userdefined cluster size constraints of different types (lower- and upper-bound constraints). Second, it allows for dynamic re-assignment of predefined constraints to clusters in iterative cluster computation/optimization, thus improving the results of constrained clustering. Third, it allows for expert-guided cluster optimization achieved by combining constrained clustering and data visualization, which enables finer-grained expert's control over the clustering process, leading to further improvements of the quality of obtained clustering solutions. Incorporating data visualization into the clustering process allows the user to select referential points which act as constraint anchors in the course of iterative cluster computation. The proposed semi-supervised constrained clustering methodology has been implemented using a service-oriented data mining environment Orange4WS and evaluated on different document corpora.
- Berkhin, P.: Survey of Clustering Data Mining Techniques. Research Paper. Accrue Software Inc. (2002).Google Scholar
- Bertsekas, D.P.: Linear Network Optimization. MIT Press, Cambridge (1991).Google Scholar
- Bradley, P.S., Bennett, K.P., Demiriz, A.: Constrained K-Means Clustering. Miscrosoft Research publication, MSR-TR-2000-65 (May 2000).Google Scholar
- Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave mininization. In: Advances in Neural Information Prcessing Systems, vol. 9, pp. 368-374. MIT Press, Cambridge (1997).Google Scholar
- Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intelligence 1(4), 224-227 (1979).Google Scholar
Digital Library
- Dhillon, I., Guan, Y., Kogan, J.: Refining clusters in high dimensional data. In: Second SIAM ICDM Workshop on Clustering High Dimensional Data (2002).Google Scholar
- Faloutsos, C., Lin, K.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data (1995). Google Scholar
Digital Library
- Forgy, E.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21, 768-780 (1965).Google Scholar
- Fortuna, B., Grobelnik, M., Mladenic, D.: Semi-automatic Data-driven Ontology Construction System. In: Proc. of the 9th Intl. Multiconf. Information Society IS 2006, Ljubljana, Slovenia (2006).Google Scholar
- Gansner, E.R., Koren, Y., North, S.: Graph Drawing by Stress Majorization. In: Pach, J. (ed.) GD 2004. LNCS, vol. 3383, pp. 239-250. Springer, Heidelberg (2005). Google Scholar
Digital Library
- Karp, R.M.: Reducibility Among Combinatorial Problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85-103. Plenum, New York (1972).Google Scholar
- Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997). Google Scholar
Digital Library
- Paige, C.C., Saunders, M.A.: Algorithm 583; LSQR: Sparse Linear Equations and Least-squares Problems. ACM Trans. on Mathematical Software (TOMS) 8(2), 195-209 (1982). Google Scholar
Digital Library
- Paulovich, F.V., Nonato, L.G., Minghim, R.: Visual Mapping of Text Collections through a Fast High Precision Projection Technique. In: Proc. of the 10th Conf. on Information Visualization, pp. 282-290 (2006). Google Scholar
Digital Library
- Podpečan, V., Juršič, M., Žakova, M., Lavrač, N.: Towards a Service-Oriented Knowledge Discovery Platform. In: SoKD: ECML/PKDD 2009 workshop on Third Generation Data Mining (2009).Google Scholar
- Sorkine, O., Cohen-Or, D.: Least-squares Meshes. In: Proc. of the Intl. Conference on Shape Modeling, pp. 191-199 (2004). Google Scholar
Digital Library
- Tan, P., Steinbach, M., Kumar, V.: Introduction to Data mining. Addison Wesley, Reading (2006). Google Scholar
Digital Library
- Tung, A.K.H., Ng, R.T., Lakshmanan, L.V.S., Han, J.: Constraint-based clustering in large databases. In: Proc. of the 8th Intl. Conf. on Database Theory, pp. 405-419 (2001). Google Scholar
Digital Library
- Wagstaff, K., Cardie, C.: Clustering with Instance-level Constraints. In: Proc. of the 17th Intl. Conf. on Machine Learning, pp. 1103-1110 (2000). Google Scholar
Digital Library
Index Terms
- Semi-supervised constrained clustering: an expert-guided data analysis methodology
Recommendations
Density-based semi-supervised clustering
Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data MiningSemi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...
Semi-supervised constrained clustering with cluster outlier filtering
CIARP'11: Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and ApplicationsConstrained clustering addresses the problem of creating minimum variance clusters with the added complexity that there is a set of constraints that must be fulfilled by the elements in the cluster. Research in this area has focused on “must-link” and “...




Comments