skip to main content
10.5555/1884293.1884317guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Semi-supervised constrained clustering: an expert-guided data analysis methodology

Authors Info & Claims
Published:30 August 2010Publication History

ABSTRACT

This paper presents a methodology for expert-guided analysis of large data sets, including large text corpora. Its main ingredient is the algorithm for semi-supervised data clustering using cluster size constraints which implements several improvements over existing k-means constrained clustering algorithms. First, it allows for a larger set of userdefined cluster size constraints of different types (lower- and upper-bound constraints). Second, it allows for dynamic re-assignment of predefined constraints to clusters in iterative cluster computation/optimization, thus improving the results of constrained clustering. Third, it allows for expert-guided cluster optimization achieved by combining constrained clustering and data visualization, which enables finer-grained expert's control over the clustering process, leading to further improvements of the quality of obtained clustering solutions. Incorporating data visualization into the clustering process allows the user to select referential points which act as constraint anchors in the course of iterative cluster computation. The proposed semi-supervised constrained clustering methodology has been implemented using a service-oriented data mining environment Orange4WS and evaluated on different document corpora.

References

  1. Berkhin, P.: Survey of Clustering Data Mining Techniques. Research Paper. Accrue Software Inc. (2002).Google ScholarGoogle Scholar
  2. Bertsekas, D.P.: Linear Network Optimization. MIT Press, Cambridge (1991).Google ScholarGoogle Scholar
  3. Bradley, P.S., Bennett, K.P., Demiriz, A.: Constrained K-Means Clustering. Miscrosoft Research publication, MSR-TR-2000-65 (May 2000).Google ScholarGoogle Scholar
  4. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave mininization. In: Advances in Neural Information Prcessing Systems, vol. 9, pp. 368-374. MIT Press, Cambridge (1997).Google ScholarGoogle Scholar
  5. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intelligence 1(4), 224-227 (1979).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dhillon, I., Guan, Y., Kogan, J.: Refining clusters in high dimensional data. In: Second SIAM ICDM Workshop on Clustering High Dimensional Data (2002).Google ScholarGoogle Scholar
  7. Faloutsos, C., Lin, K.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Forgy, E.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21, 768-780 (1965).Google ScholarGoogle Scholar
  9. Fortuna, B., Grobelnik, M., Mladenic, D.: Semi-automatic Data-driven Ontology Construction System. In: Proc. of the 9th Intl. Multiconf. Information Society IS 2006, Ljubljana, Slovenia (2006).Google ScholarGoogle Scholar
  10. Gansner, E.R., Koren, Y., North, S.: Graph Drawing by Stress Majorization. In: Pach, J. (ed.) GD 2004. LNCS, vol. 3383, pp. 239-250. Springer, Heidelberg (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Karp, R.M.: Reducibility Among Combinatorial Problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85-103. Plenum, New York (1972).Google ScholarGoogle Scholar
  12. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Paige, C.C., Saunders, M.A.: Algorithm 583; LSQR: Sparse Linear Equations and Least-squares Problems. ACM Trans. on Mathematical Software (TOMS) 8(2), 195-209 (1982). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paulovich, F.V., Nonato, L.G., Minghim, R.: Visual Mapping of Text Collections through a Fast High Precision Projection Technique. In: Proc. of the 10th Conf. on Information Visualization, pp. 282-290 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Podpečan, V., Juršič, M., Žakova, M., Lavrač, N.: Towards a Service-Oriented Knowledge Discovery Platform. In: SoKD: ECML/PKDD 2009 workshop on Third Generation Data Mining (2009).Google ScholarGoogle Scholar
  16. Sorkine, O., Cohen-Or, D.: Least-squares Meshes. In: Proc. of the Intl. Conference on Shape Modeling, pp. 191-199 (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data mining. Addison Wesley, Reading (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tung, A.K.H., Ng, R.T., Lakshmanan, L.V.S., Han, J.: Constraint-based clustering in large databases. In: Proc. of the 8th Intl. Conf. on Database Theory, pp. 405-419 (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wagstaff, K., Cardie, C.: Clustering with Instance-level Constraints. In: Proc. of the 17th Intl. Conf. on Machine Learning, pp. 1103-1110 (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semi-supervised constrained clustering: an expert-guided data analysis methodology
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image Guide Proceedings
        PRICAI'10: Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
        August 2010
        715 pages
        ISBN:3642152457
        • Editors:
        • Byoung-Tak Zhang,
        • Mehmet A. Orgun

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        • Published: 30 August 2010

        Qualifiers

        • Article
      • Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics