10.1145/1830252.1830264acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Document classification utilising ontologies and relations between documents

Online:24 July 2010Publication History

ABSTRACT

Two major types of relational information can be utilized in automatic document classification as background information: relations between terms, such as ontologies, and relations between documents, such as web links or citations in articles. We introduce a model where a traditional bag-of-words type classifier is gradually extended to utilize both of these information types. The experiments with data from the Finnish National Archive show that classification accuracy improves from 70% to 74% when the General Finnish Ontology YSO is used as background information, without using relations between documents.

References

  1. J. Aitchison, A. Gilchrist, and D. Bawden. Thesaurus Construction and Use: A Practical Manual. Europa Publications, 4th edition, 2000.Google ScholarGoogle Scholar
  2. O. Alm. Tekstidokumenttien automaattinen ontologiaperustainen annotointi. Master's thesis, University of Helsinki, Department of Computer Science, September 2007.Google ScholarGoogle Scholar
  3. E. Alpaydin. Introduction to Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. N. Archive. Asiankäsittelyjärjestelmiin sisältyvien pysyvästi säilytettävien asiakirjallisten tietojen säilyttäminen yksinomaan sähköisessä muodossa (Specifications for the Permanent Storage of Information on Digital Documents to be Contained in Case Treatment Systems), 2005. Also available at http://www.narc.fi/Arkistolaitos/pdf-ohjeet/akj_maarays.pdf.Google ScholarGoogle Scholar
  5. R. Basili, M. Cammisa, and A. Moschitti. A semantic kernel to classify texts with very few training examples. Informatica, an international journal of Computing and Informatics, 2006.Google ScholarGoogle Scholar
  6. S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. Owl web ontology language reference, 2004.Google ScholarGoogle Scholar
  7. C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Bloehdorn, R. Basili, M. Cammisa, and A. Moschitti. Semantic kernels for text classification based on topological measures of feature similarity. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 06), Hong Kong, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. N. Borst. Construction of Engineering Ontologies for Knowledge Sharing and Reuse. PhD thesis, University of Twente, Netherlands, 1997.Google ScholarGoogle Scholar
  10. D. Brickley and R. Guha. Rdf vocabulary description language 1.0: Rdf schema, 2004.Google ScholarGoogle Scholar
  11. F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Connexor Oy. Machinese Linguistic Analysers, 2006.Google ScholarGoogle Scholar
  13. N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. Journal of Intelligent Information Systems, 18(2--3):127--152, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. de Buenaga Rodríguez, J. M. G. Hidalgo, and B. Díaz-Agudo. Using wordnet to complement training information in text categorization. In Recent Advances in Natural Language Processing II, volume 189, pages 353--364, 1997.Google ScholarGoogle Scholar
  15. S. Deerwester et al. Improving information retrieval with latent semantic indexing. In Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pages 36--40, 1988.Google ScholarGoogle Scholar
  16. Finnish National Archive. SäHKE-hanke, Abstrakti mallintaminen (Abstract Modelling of the SäHKE Project), 2005. Also available at http://www.narc.fi/sahke/Aineisto/SAHKE-abstrakti-V2-koko.pdf.Google ScholarGoogle Scholar
  17. E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In IJCAI'05, pages 1048--1053, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5:199--220, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Guerino and C. A. Welty. An overview of ontoclean. In S. Staab and R. Studer, editors, Handbook on Ontologies, pages 151--172, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Handschuh, S. Staab, and A. Maedche. Cream --- creating relational metadata with a component-based, ontology-driven annotation framework. In Proceedings of K-Cap 2001, Victoria, BC, Canada, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In 3rd IEEE International Conference on Data Mining, pages 541--544, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Hyvönen, K. Viljanen, J. Tuominen, and K. Seppälä. Building a national semantic web ontology and ontology service infrastructure---the FinnONTO approach. In Proceedings of the 5th European Semantic Web Conference (ESWC 2008). Springer-Verlag, June 1--5 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. D. Manning and H. Schütze. Foundation of Statistical Natural Language Processing. The MIT Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Manola and E. Miller. Rdf primer, 2004.Google ScholarGoogle Scholar
  26. G. A. Miller. Wordnet: a lexical database for english. In Communications of the ACM, volume 38, pages 39--41, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pennock. Statistical relational learning for document mining. In Proceedings of IEEE International Conference on Data Mining (ICDM-2003), pages 275--282, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Scott and S. Matwin. Text classification using wordnet hypernyms. In Workshop---Usage of WordNet in Natural Language Processing Systems, Montreal, Canada, 1998.Google ScholarGoogle Scholar
  29. P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3), 2008.Google ScholarGoogle ScholarCross RefCross Ref
  30. G. Siolas and F. d'Alch Buc. Support vector machines based on a semantic kernel for text categorization. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00), volume 5, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Studer, V. R. Benjamins, and D. Fendel. Knowledge engineering: Principles and methods. IEEE Transactions on Data and Knowledge Engineering, 25(1--2):161--197, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Conference on Uncertainty in Artificial Intelligence (UAI02), pages 485--492, Edmonton, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. A. Ureña-López, M. Buenaga, and J. M. Gómez. Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2):215--230, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  34. M. Witbrock, E. Coppock, R. Kahlert, and B. Rode. Cyc-enhanced machine classification. Technical report, Cycorp, Inc., 2009.Google ScholarGoogle Scholar
  35. D. Zhang and R. Mao. Extracting community structure features for hypertext classification. In Proceedings of the 3rd IEEE International Conference on Digital Information Management (ICDIM), pages 436--441, London, UK, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Document classification utilising ontologies and relations between documents

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              ACM Conferences cover image
              MLG '10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs
              July 2010
              185 pages
              ISBN:9781450302142
              DOI:10.1145/1830252

              Copyright © 2010 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Online: 24 July 2010
              • Published: 24 July 2010

              Permissions

              Request permissions about this article.

              Request Permissions

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!