ABSTRACT
Two major types of relational information can be utilized in automatic document classification as background information: relations between terms, such as ontologies, and relations between documents, such as web links or citations in articles. We introduce a model where a traditional bag-of-words type classifier is gradually extended to utilize both of these information types. The experiments with data from the Finnish National Archive show that classification accuracy improves from 70% to 74% when the General Finnish Ontology YSO is used as background information, without using relations between documents.
References
- J. Aitchison, A. Gilchrist, and D. Bawden. Thesaurus Construction and Use: A Practical Manual. Europa Publications, 4th edition, 2000.Google Scholar
- O. Alm. Tekstidokumenttien automaattinen ontologiaperustainen annotointi. Master's thesis, University of Helsinki, Department of Computer Science, September 2007.Google Scholar
- E. Alpaydin. Introduction to Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2004. Google Scholar
Digital Library
- F. N. Archive. Asiankäsittelyjärjestelmiin sisältyvien pysyvästi säilytettävien asiakirjallisten tietojen säilyttäminen yksinomaan sähköisessä muodossa (Specifications for the Permanent Storage of Information on Digital Documents to be Contained in Case Treatment Systems), 2005. Also available at http://www.narc.fi/Arkistolaitos/pdf-ohjeet/akj_maarays.pdf.Google Scholar
- R. Basili, M. Cammisa, and A. Moschitti. A semantic kernel to classify texts with very few training examples. Informatica, an international journal of Computing and Informatics, 2006.Google Scholar
- S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. Owl web ontology language reference, 2004.Google Scholar
- C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007. Google Scholar
Digital Library
- S. Bloehdorn, R. Basili, M. Cammisa, and A. Moschitti. Semantic kernels for text classification based on topological measures of feature similarity. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 06), Hong Kong, 2006. Google Scholar
Digital Library
- W. N. Borst. Construction of Engineering Ontologies for Knowledge Sharing and Reuse. PhD thesis, University of Twente, Netherlands, 1997.Google Scholar
- D. Brickley and R. Guha. Rdf vocabulary description language 1.0: Rdf schema, 2004.Google Scholar
- F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02), 2002. Google Scholar
Digital Library
- Connexor Oy. Machinese Linguistic Analysers, 2006.Google Scholar
- N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. Journal of Intelligent Information Systems, 18(2--3):127--152, 2002. Google Scholar
Digital Library
- M. de Buenaga Rodríguez, J. M. G. Hidalgo, and B. Díaz-Agudo. Using wordnet to complement training information in text categorization. In Recent Advances in Natural Language Processing II, volume 189, pages 353--364, 1997.Google Scholar
- S. Deerwester et al. Improving information retrieval with latent semantic indexing. In Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pages 36--40, 1988.Google Scholar
- Finnish National Archive. SäHKE-hanke, Abstrakti mallintaminen (Abstract Modelling of the SäHKE Project), 2005. Also available at http://www.narc.fi/sahke/Aineisto/SAHKE-abstrakti-V2-koko.pdf.Google Scholar
- E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In IJCAI'05, pages 1048--1053, 2005. Google Scholar
Digital Library
- T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5:199--220, 1993. Google Scholar
Digital Library
- N. Guerino and C. A. Welty. An overview of ontoclean. In S. Staab and R. Studer, editors, Handbook on Ontologies, pages 151--172, 2004.Google Scholar
Cross Ref
- S. Handschuh, S. Staab, and A. Maedche. Cream --- creating relational metadata with a component-based, ontology-driven annotation framework. In Proceedings of K-Cap 2001, Victoria, BC, Canada, 2001. Google Scholar
Digital Library
- A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In 3rd IEEE International Conference on Data Mining, pages 541--544, 2003. Google Scholar
Digital Library
- E. Hyvönen, K. Viljanen, J. Tuominen, and K. Seppälä. Building a national semantic web ontology and ontology service infrastructure---the FinnONTO approach. In Proceedings of the 5th European Semantic Web Conference (ESWC 2008). Springer-Verlag, June 1--5 2008. Google Scholar
Digital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google Scholar
Digital Library
- C. D. Manning and H. Schütze. Foundation of Statistical Natural Language Processing. The MIT Press, 2000. Google Scholar
Digital Library
- F. Manola and E. Miller. Rdf primer, 2004.Google Scholar
- G. A. Miller. Wordnet: a lexical database for english. In Communications of the ACM, volume 38, pages 39--41, 1995. Google Scholar
Digital Library
- A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pennock. Statistical relational learning for document mining. In Proceedings of IEEE International Conference on Data Mining (ICDM-2003), pages 275--282, 2003. Google Scholar
Digital Library
- S. Scott and S. Matwin. Text classification using wordnet hypernyms. In Workshop---Usage of WordNet in Natural Language Processing Systems, Montreal, Canada, 1998.Google Scholar
- P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3), 2008.Google Scholar
Cross Ref
- G. Siolas and F. d'Alch Buc. Support vector machines based on a semantic kernel for text categorization. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00), volume 5, 2000. Google Scholar
Digital Library
- R. Studer, V. R. Benjamins, and D. Fendel. Knowledge engineering: Principles and methods. IEEE Transactions on Data and Knowledge Engineering, 25(1--2):161--197, 1998. Google Scholar
Digital Library
- B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Conference on Uncertainty in Artificial Intelligence (UAI02), pages 485--492, Edmonton, 2002. Google Scholar
Digital Library
- L. A. Ureña-López, M. Buenaga, and J. M. Gómez. Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2):215--230, 2001.Google Scholar
Cross Ref
- M. Witbrock, E. Coppock, R. Kahlert, and B. Rode. Cyc-enhanced machine classification. Technical report, Cycorp, Inc., 2009.Google Scholar
- D. Zhang and R. Mao. Extracting community structure features for hypertext classification. In Proceedings of the 3rd IEEE International Conference on Digital Information Management (ICDIM), pages 436--441, London, UK, 2008.Google Scholar
Index Terms
Document classification utilising ontologies and relations between documents




Comments