skip to main content
research-article

An EDU-Based Approach for Thai Multi-Document Summarization and Its Application

Published:30 January 2015Publication History
Skip Abstract Section

Abstract

Due to lack of a word/phrase/sentence boundary, summarization of Thai multiple documents has several challenges in unit segmentation, unit selection, duplication elimination, and evaluation dataset construction. In this article, we introduce Thai Elementary Discourse Units (TEDUs) and their derivatives, called Combined TEDUs (CTEDUs), and then present our three-stage method of Thai multi-document summarization, that is, unit segmentation, unit-graph formulation, and unit selection and summary generation. To examine performance of our proposed method, a number of experiments are conducted using 50 sets of Thai news articles with their manually constructed reference summaries. Based on measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the experimental results show that: (1) the TEDU-based summarization outperforms paragraph-based summarization; (2) our proposed graph-based TEDU weighting with importance-based selection achieves the best performance; and (3) unit duplication consideration and weight recalculation help improve summary quality.

References

  1. Alguliev, R. M., Aliguliyev, R. M., Hajirahimova, M. S., and Mehdiyev, C. A. 2011. Mcmr: Maximum coverage and minimum redundant text summarization model. Expert Syst. Appl. 38, 12, 14514--14522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aliguliyev, R. M. 2009. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36, 4, 7764--7772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Barzilay, R., McKeown, K. R., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 550--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cai, X. and Li, W. 2011. A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously. Inf. Sci. 181, 18, 3816--3827.Google ScholarGoogle ScholarCross RefCross Ref
  5. Carbonell, J. and Goldstein, J. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 335--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carlson, L., Marcu, D., and Okurowski, M. E. 2003. Building a discourse-tagged corpus in the frame-work of rhetorical structure theory. In Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue (SIGDIAL’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Charoensuk, J., Sukvaree, T., and Kawtrakul, A. 2005. Elementary discourse unit segmentation for thai using discourse cues and syntactic information. In Proceedings of the 6th Symposium on Natural Language Processing (SNLP’05).Google ScholarGoogle Scholar
  8. Chongsuntornsri, A. and Sornil, O. 2006. An automatic thai text summarization using topic sensitive pagerank. In Proceedings of the International Symposium on Communications and Information Technologies (ISCIT ’06). 547--552.Google ScholarGoogle Scholar
  9. Deza, M. M. and Deza, E. 2009. Encyclopedia of Distances. Springer.Google ScholarGoogle Scholar
  10. Erkan, G. and Radev, D. R. 2004. Lexpagerank: Prestige in multi-document text summarization. http://clair.si.umich.edu/~radev/papers/emnlp04pos.pdf.Google ScholarGoogle Scholar
  11. Ferreira, R., Cabral, L. D. S., Lins, R. D., Silva, G. P., Freitas, F., Cavalcanti, G. D., Lima, R., Simske, S. J., and Favaro, L. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40, 14, 5755--5764.Google ScholarGoogle ScholarCross RefCross Ref
  12. Goldstein, J. and Carbonell, J. 1998. Summarization: (1) using mmr for diversity - based reranking and (2) evaluating summaries. In Proceedings of the Workshop on Tipster Text Program (TIPSTER’98). Association for Computational Linguistics, 181--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jaccard, P. 1901. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Societe Vaudoise des Sciences Naturelles 37, 547--579.Google ScholarGoogle Scholar
  14. Jaruskulchai, C. and Kruengkrai, C. 2003. A practical text summarizer by paragraph extraction for thai. In Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages (AsianIR’03). 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ketui, N. and Theeramunkong, T. 2010. Inclusion-based and exclusion-based approaches in graph-based multiple news summarization. In Proceedings of the 5th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’10), Lecture Notes in Computer Science, vol. 6746, Springer, 91--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ketui, N., Theeramunkong, T., and Onsuwan, C. 2012. A rule-based method for thai elementary discourse unit segmentation (ted-seg). In Proceedings of the 7th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’12), IEEE Computer Society. 195--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ketui, N., Theeramunkong, T., and Onsuwan, C. 2013. Thai elementary discourse unit analysis and syntactic-based segmentation. Inf.-Ann. Int. Interdiscipl. J. 16, 10, 7423--7436.Google ScholarGoogle Scholar
  18. Kittiphattanabawon, N., Theeramunkong, T., and Nantajeewarawat, E. 2010. Exploration of document relation quality with consideration of term representation basis, term weighting and association measure. In Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics (PAISI’10), Lecture Notes in Computer Science, vol. 6122, Springer, 126--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kuo, J.-J. and Chen, H.-H. 2008. Multidocument summary generation: Using informative and event words. ACM Trans. Asian Lang. Inform. Process. 7, 1, 3:1--3:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Proceeding of the ACL Workshop on Text Summarization Branches Out (WAS’04). 74--81.Google ScholarGoogle Scholar
  21. Maier, D. 1978. The complexity of some problems on subsequences and supersequences. J. ACM 25, 2, 322--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mani, I. 1997. Multi-document summarization by graph search and matching. In Proceedings of the 14th National Conference on Artificial Intelligence and the 9th Conference on Innovative Applicatins of Artificial Intelligence (AAAI/IAAI’97), 622--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retriev. 1, 35--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. McKeown, K., Klavans, J., Hatzivassiloglou, V., Barzilay, R., and Eskin, E. 1999. Towards multi-document summarization by reformulation: Progress and prospects. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference (AAAI/IAAI’99). 453--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. McKeown, K. and Radev, D. 1999. Generating summaries of multiple news articles. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). 74--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Meknavin, S., Charoenpornsawat, P., and Kijsirikul, B. 1997. Feature-based thai word segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’97).Google ScholarGoogle Scholar
  27. Mihalcea, R. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL Interactive Poster and Demonstration Sessions (ACLdemo’04). Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Okazaki, N., Matsuo, Y., and Ishizuka, M. 2005. Improving chronological ordering of sentences extracted from multiple newspaper articles. ACM Trans. Asian Lang. Inform. Process. 4, 3, 321--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdfGoogle ScholarGoogle Scholar
  30. Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the NAACL-ANLP Workshop on Automatic Summarization (NAACL-ANLP-AutoSum’00), 21--30. Association for Computational Linguistics. 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Singhal, A. 2001. Modern information retrieval: A brief overview. Bull. IEEE Comput. Soc. Technic. Committee Data Engin. 24, 4, 35--43.Google ScholarGoogle Scholar
  32. Sinthupoun, S. and Sornil, O. 2010. Thai rhetorical structure analysis. Int. J. Comput. Sci. Inf. Secur. 7, 1, 95--105.Google ScholarGoogle Scholar
  33. Sornil, O. and Gree-ut, K. 2006. An automatic text summarization approach using content-based and graph-based characteristics. In Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems (ICCIS’06), 1--6.Google ScholarGoogle Scholar
  34. Sukvaree, T., Kawtrakul, A., and Caelen, J. 2007. Thai text coherence structuring with coordinating and subordinating relations for text summarization. In Proceedings of the 6th International and Interdisciplinary Conference on Modeling and Using Context (CONTEXT’07), 453--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Suwanno, N., Suzuki, Y., and Yamazaki, H. 2005. Extracting thai compound nouns for paragraph extraction in thai text. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP/KE’05), 657--662.Google ScholarGoogle Scholar
  36. Thangthai, A. and Jaruskulchai, C. 2004. Impact parameter on lsa performance for thai text summarization. In Proceedings of the 43rd Kasetsart University Annual Conference: Veterinary Medicine, Science (Vichakarn’04). 331--339.Google ScholarGoogle Scholar
  37. Theeramunkong, T., Boriboon, M., Haruechaiyasak, C., Kittiphattanabawon, N., Kosawat, K., Onsuwan, C., Siriwat, I., Suwanapong, T., and Tongtep, N. 2010. Thai-nest: A framework for thai named entity tagging specification and tools. In Proceedings of the 2nd International Conference on Corpus Linguistics (CILC’10), 895--908.Google ScholarGoogle Scholar
  38. Tongtep, N. and Theeramunkong, T. 2013. Multi-stage automatic ne and pos annotation using pattern-based and atatistical-based techniques for thai corpus construction. IEICE Trans. Inf. Syst. E96-D, 10, 2245--2256.Google ScholarGoogle ScholarCross RefCross Ref
  39. Wang, H. and Zhou, G. 2012. Toward a unified framework for standard and update multi-document summarization. ACM Trans. Asian Lang. Inform. Process. 11, 2, 5:1--5:18. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An EDU-Based Approach for Thai Multi-Document Summarization and Its Application

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!