Abstract
Due to lack of a word/phrase/sentence boundary, summarization of Thai multiple documents has several challenges in unit segmentation, unit selection, duplication elimination, and evaluation dataset construction. In this article, we introduce Thai Elementary Discourse Units (TEDUs) and their derivatives, called Combined TEDUs (CTEDUs), and then present our three-stage method of Thai multi-document summarization, that is, unit segmentation, unit-graph formulation, and unit selection and summary generation. To examine performance of our proposed method, a number of experiments are conducted using 50 sets of Thai news articles with their manually constructed reference summaries. Based on measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the experimental results show that: (1) the TEDU-based summarization outperforms paragraph-based summarization; (2) our proposed graph-based TEDU weighting with importance-based selection achieves the best performance; and (3) unit duplication consideration and weight recalculation help improve summary quality.
- Alguliev, R. M., Aliguliyev, R. M., Hajirahimova, M. S., and Mehdiyev, C. A. 2011. Mcmr: Maximum coverage and minimum redundant text summarization model. Expert Syst. Appl. 38, 12, 14514--14522. Google Scholar
Digital Library
- Aliguliyev, R. M. 2009. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36, 4, 7764--7772. Google Scholar
Digital Library
- Barzilay, R., McKeown, K. R., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 550--557. Google Scholar
Digital Library
- Cai, X. and Li, W. 2011. A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously. Inf. Sci. 181, 18, 3816--3827.Google Scholar
Cross Ref
- Carbonell, J. and Goldstein, J. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 335--336. Google Scholar
Digital Library
- Carlson, L., Marcu, D., and Okurowski, M. E. 2003. Building a discourse-tagged corpus in the frame-work of rhetorical structure theory. In Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue (SIGDIAL’03). Google Scholar
Digital Library
- Charoensuk, J., Sukvaree, T., and Kawtrakul, A. 2005. Elementary discourse unit segmentation for thai using discourse cues and syntactic information. In Proceedings of the 6th Symposium on Natural Language Processing (SNLP’05).Google Scholar
- Chongsuntornsri, A. and Sornil, O. 2006. An automatic thai text summarization using topic sensitive pagerank. In Proceedings of the International Symposium on Communications and Information Technologies (ISCIT ’06). 547--552.Google Scholar
- Deza, M. M. and Deza, E. 2009. Encyclopedia of Distances. Springer.Google Scholar
- Erkan, G. and Radev, D. R. 2004. Lexpagerank: Prestige in multi-document text summarization. http://clair.si.umich.edu/~radev/papers/emnlp04pos.pdf.Google Scholar
- Ferreira, R., Cabral, L. D. S., Lins, R. D., Silva, G. P., Freitas, F., Cavalcanti, G. D., Lima, R., Simske, S. J., and Favaro, L. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40, 14, 5755--5764.Google Scholar
Cross Ref
- Goldstein, J. and Carbonell, J. 1998. Summarization: (1) using mmr for diversity - based reranking and (2) evaluating summaries. In Proceedings of the Workshop on Tipster Text Program (TIPSTER’98). Association for Computational Linguistics, 181--195. Google Scholar
Digital Library
- Jaccard, P. 1901. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Societe Vaudoise des Sciences Naturelles 37, 547--579.Google Scholar
- Jaruskulchai, C. and Kruengkrai, C. 2003. A practical text summarizer by paragraph extraction for thai. In Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages (AsianIR’03). 9--16. Google Scholar
Digital Library
- Ketui, N. and Theeramunkong, T. 2010. Inclusion-based and exclusion-based approaches in graph-based multiple news summarization. In Proceedings of the 5th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’10), Lecture Notes in Computer Science, vol. 6746, Springer, 91--102. Google Scholar
Digital Library
- Ketui, N., Theeramunkong, T., and Onsuwan, C. 2012. A rule-based method for thai elementary discourse unit segmentation (ted-seg). In Proceedings of the 7th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’12), IEEE Computer Society. 195--202. Google Scholar
Digital Library
- Ketui, N., Theeramunkong, T., and Onsuwan, C. 2013. Thai elementary discourse unit analysis and syntactic-based segmentation. Inf.-Ann. Int. Interdiscipl. J. 16, 10, 7423--7436.Google Scholar
- Kittiphattanabawon, N., Theeramunkong, T., and Nantajeewarawat, E. 2010. Exploration of document relation quality with consideration of term representation basis, term weighting and association measure. In Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics (PAISI’10), Lecture Notes in Computer Science, vol. 6122, Springer, 126--139. Google Scholar
Digital Library
- Kuo, J.-J. and Chen, H.-H. 2008. Multidocument summary generation: Using informative and event words. ACM Trans. Asian Lang. Inform. Process. 7, 1, 3:1--3:23. Google Scholar
Digital Library
- Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Proceeding of the ACL Workshop on Text Summarization Branches Out (WAS’04). 74--81.Google Scholar
- Maier, D. 1978. The complexity of some problems on subsequences and supersequences. J. ACM 25, 2, 322--336. Google Scholar
Digital Library
- Mani, I. 1997. Multi-document summarization by graph search and matching. In Proceedings of the 14th National Conference on Artificial Intelligence and the 9th Conference on Innovative Applicatins of Artificial Intelligence (AAAI/IAAI’97), 622--628. Google Scholar
Digital Library
- Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retriev. 1, 35--67. Google Scholar
Digital Library
- McKeown, K., Klavans, J., Hatzivassiloglou, V., Barzilay, R., and Eskin, E. 1999. Towards multi-document summarization by reformulation: Progress and prospects. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference (AAAI/IAAI’99). 453--460. Google Scholar
Digital Library
- McKeown, K. and Radev, D. 1999. Generating summaries of multiple news articles. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). 74--82. Google Scholar
Digital Library
- Meknavin, S., Charoenpornsawat, P., and Kijsirikul, B. 1997. Feature-based thai word segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’97).Google Scholar
- Mihalcea, R. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL Interactive Poster and Demonstration Sessions (ACLdemo’04). Association for Computational Linguistics. Google Scholar
Digital Library
- Okazaki, N., Matsuo, Y., and Ishizuka, M. 2005. Improving chronological ordering of sentences extracted from multiple newspaper articles. ACM Trans. Asian Lang. Inform. Process. 4, 3, 321--339. Google Scholar
Digital Library
- Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdfGoogle Scholar
- Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the NAACL-ANLP Workshop on Automatic Summarization (NAACL-ANLP-AutoSum’00), 21--30. Association for Computational Linguistics. 21--30. Google Scholar
Digital Library
- Singhal, A. 2001. Modern information retrieval: A brief overview. Bull. IEEE Comput. Soc. Technic. Committee Data Engin. 24, 4, 35--43.Google Scholar
- Sinthupoun, S. and Sornil, O. 2010. Thai rhetorical structure analysis. Int. J. Comput. Sci. Inf. Secur. 7, 1, 95--105.Google Scholar
- Sornil, O. and Gree-ut, K. 2006. An automatic text summarization approach using content-based and graph-based characteristics. In Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems (ICCIS’06), 1--6.Google Scholar
- Sukvaree, T., Kawtrakul, A., and Caelen, J. 2007. Thai text coherence structuring with coordinating and subordinating relations for text summarization. In Proceedings of the 6th International and Interdisciplinary Conference on Modeling and Using Context (CONTEXT’07), 453--466. Google Scholar
Digital Library
- Suwanno, N., Suzuki, Y., and Yamazaki, H. 2005. Extracting thai compound nouns for paragraph extraction in thai text. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP/KE’05), 657--662.Google Scholar
- Thangthai, A. and Jaruskulchai, C. 2004. Impact parameter on lsa performance for thai text summarization. In Proceedings of the 43rd Kasetsart University Annual Conference: Veterinary Medicine, Science (Vichakarn’04). 331--339.Google Scholar
- Theeramunkong, T., Boriboon, M., Haruechaiyasak, C., Kittiphattanabawon, N., Kosawat, K., Onsuwan, C., Siriwat, I., Suwanapong, T., and Tongtep, N. 2010. Thai-nest: A framework for thai named entity tagging specification and tools. In Proceedings of the 2nd International Conference on Corpus Linguistics (CILC’10), 895--908.Google Scholar
- Tongtep, N. and Theeramunkong, T. 2013. Multi-stage automatic ne and pos annotation using pattern-based and atatistical-based techniques for thai corpus construction. IEICE Trans. Inf. Syst. E96-D, 10, 2245--2256.Google Scholar
Cross Ref
- Wang, H. and Zhou, G. 2012. Toward a unified framework for standard and update multi-document summarization. ACM Trans. Asian Lang. Inform. Process. 11, 2, 5:1--5:18. Google Scholar
Digital Library
Index Terms
An EDU-Based Approach for Thai Multi-Document Summarization and Its Application
Recommendations
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataExtraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Topic-Driven Multi-document Summarization
IALP '10: Proceedings of the 2010 International Conference on Asian Language ProcessingThis paper presents a topic-driven framework for generating a generic summary from multi-documents. Our approach is based on the intuition that, from the statistical point of view, the summary’s probability distribution over the topics should be ...






Comments