Abstract
The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: 女书; literally: “women’s writing”) as a case study to present the first computational approach that combines Computer Vision and Natural Language Processing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new challenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance indicates that the system is a promising approach and can serve as a standard benchmark.
- Zhen-Long Bai and Qiang Huo. 2005. A study on the use of 8-directional features for online handwritten Chinese character recognition. In Proc. of ICDAR 2005. 262--266. Google Scholar
Digital Library
- Stephen Beale. 2014. Time to change the “D” in “DEL”. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
Cross Ref
- Emily Bender. 2008. Evaluating a crosslinguistic grammar resource: A case study of wambaya. In the Proc. of ACL 2008.Google Scholar
- Emily Bender, Joshua Crowgey, Michael Wayne Goodman, and Fei Xia. 2014. Learning grammar specifications from IGT: A case study of chintang. In the Proc. of ACL 2014, Workshop on ComputEL.Google Scholar
Cross Ref
- Emily Bender, Michael Wayne Goodman, Joshua Crowgey, and Fei Xia. 2013. Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proc. of LaTeCH2013.Google Scholar
- Martin Benjamin and Paula Radetzky. 2014. Small languages, big data: Multilingual computational tools and techniques for the lexicography of endangered languages. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
Cross Ref
- Steven Bird. 2009. Natural language processing and linguistic fieldwork. Computational Linguistics 35, 3 (2009), 469--474. Google Scholar
Digital Library
- Steven Bird and David Chiang. 2012. Machine translation for language preservation. In Proc. of ICCL2012.Google Scholar
- Steven Bird, Florian R. Hanke1, Oliver Adams, and Haejoong Lee. 2014. Aikuma: A mobile app for collaborative language documentation. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
Cross Ref
- David Bradley. 2005. Introduction: Language policy and language endangerment in china. International Journal of the Sociology of Language 12, 173 (2005), 1--21.Google Scholar
Cross Ref
- Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32. Google Scholar
Digital Library
- Francisco Casacuberta. 2001. Finite-state transducers for speech-input translation. In Proc. of ASRU2013.Google Scholar
Cross Ref
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297. Google Scholar
Digital Library
- Qiang Fu, Xiaoqing Ding, Tong Liu, Yan Jiang, and Zheng Ren. 2006. A novel segmentation and recognition algorithm for chinese handwritten address character strings. In Proc. of ICPR2006, IEEE (Ed.), Vol. 2. 974--977. Google Scholar
Digital Library
- Jonathan Graehl. 1997. Carmel finite-state toolkit. http://www.isi.edu/licensed-sw/carmel/. (1997).Google Scholar
- Zhi Han, Chang-Ping Liu, and Xu-Cheng Yin. 2005. A two-stage handwritten character segmentation approach in mail address recognition. In Proc. of ICDAR2005, Vol. 1. IEEE Computer Society, 111--115. Google Scholar
Digital Library
- Jiren He and Zuyi Jiang. 1985. Naxi Language Briefing. Minzu Press.Google Scholar
- Bufan Huang. 1985. An overview of muya language. National Languages 8 (1985).Google Scholar
- Xuezhen Huang. 1993. Jiangyong dialect research. Social Science Press (1993).Google Scholar
- Russell A. Kirsch. 1971. Computer determination of the constituent structure of biological images. Computers and Biomedical Research 4, 3 (1971), 315--328.Google Scholar
- Kevin Knight and Yaser Al-Onaizan. 1998. Translation with finite-state devices. In Proc. of AMTA1998. Google Scholar
Digital Library
- Michael Krauss. 1992. The world’s languages in crisis. Languages 68, 1 (1992), 4--10.Google Scholar
- Shankar Kumar and William Byrne. 2003. A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In Proc. of NAACL-HLT 2003. Google Scholar
Digital Library
- Khang Nhut Lam, Feras Al Tarouti, and Jugal Kalita. 2014. Creating lexical resources for endangered languages. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
Cross Ref
- Fang-kuei Li. 2005. Po-ai dialect. Tsinghua University Press (2005).Google Scholar
- Cheng-Lin Liu, Stefan Jaeger, and Masaki Nakagawa. 2004. Online recognition of chinese characters: The state-of-the-art. Trans. PAMI 26, 2 (2004), 198--213. Google Scholar
Digital Library
- Cheng-Lin Liu, Masashi Koga, Hiroshi Sako, and Hiromichi Fujisawa. 2000. Aspect ratio adaptive normalization for handwritten character recognition. In Advances in Multimodal Interfaces—ICMI 2000, Tieniu Tan, Yuanchun Shi, and Wen Gao (Eds.). Lecture Notes in Computer Science, Vol. 1948. Springer Berlin Heidelberg, 418--425. Google Scholar
Digital Library
- Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. 2013. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognition 46, 1 (2013), 155--162. Google Scholar
Digital Library
- Edward O. Ombui1, Peter W. Wagacha, and Wanjiku Nganga. 2014. InterlinguaPlus machine translation approach for under-resourced languages: Ekegusii and swahili. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
- Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. Systems, Man and Cybernetics, IEEE Transactions on 9, 1 (Jan 1979), 62--66. DOI:http://dx.doi.org/10.1109/TSMC.1979.4310076Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. of ACL2002. 311--318. Google Scholar
Digital Library
- Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. In Proc. of the Royal Society of London. 240--242.Google Scholar
- Hammam Riza. 2008. Indigenous languages of Indonesia: Creating language resources for language preservation. In Proc. of IJCNLP2008, Workshop on NLP for Less Privileged Languages.Google Scholar
- Hongkai Sun. 1983. Overview of ersu language. Language Research (1983).Google Scholar
- Hongkai Sun, Zengyi Hu, and Xing Huang. 2007. Chinese Languages. Commercial Press.Google Scholar
- Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction method for character recognition - A survey. Pattern Recognition 29, 4 (1996), 641--662.Google Scholar
Cross Ref
- Koji Tsuda and Bernhard Schölkopf. 2004. A primer on kernel methods. In Kernel Methods in Computational Biology. MIT Press, 35--70.Google Scholar
- Morgan Ulinski, Anusha Balakrishnan, Daniel Bauer, Bob Coyne, Julia Hirschberg, and Owen Rambow. 2014. Documenting endangered languages with the wordsEye linguistics tool. In Proc. of ACL2014, Workshop on ComputEL.Google Scholar
Cross Ref
- Qiu-Feng Wang, Fei Yin, and Cheng-Lin Liu. 2012. Handwritten chinese text recognition by integrating multiple contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 8 (2012), 1469--1481. Google Scholar
Digital Library
- Junru Zhang. 1980. Shuiyu Briefing. Minzu Press.Google Scholar
- Liming Zhao. 1995. Nyushu and Nyushu Culture. Xinhua Press.Google Scholar
- Liming Zhao. 2004a. The Comparison of Nyushu Characters. Intellectual Property Press.Google Scholar
- Liming Zhao. 2004b. Research on the Characters in the Nyushu Script by the one Hundred Years Old Lady Yang Huanyi. International Culture Publishing House.Google Scholar
- Liming Zhao. 2005. Chinese Nyushu Script Collection. Zhonghua Book Company.Google Scholar
- Liming Zhao. 2008. Nyushu Booklet. Hunan People’s Press.Google Scholar
- Liming Zhao and Zhaolin Song. 2011. A Map Record of the Endangered Languages in Southwestern China. Xueyuan Press.Google Scholar
- Liming Zhao and Yan Zhang. 2014. The collection of endangered literature from the minority groups in southwestern china -- namuyi-tibetan bozi literature. Guangxi Normal University Press (2014).Google Scholar
- Shuyan Zhao, Zheru Chi, Penfei Shi, and Hong Yan. 2003. Two-stage segmentation of unconstrained handwritten chinese characters. Pattern Recognition 36, 1 (2003), 145--156.Google Scholar
Cross Ref
- Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In Proc. of ICML2007. ACM, New York, NY, USA, 1151--1157. Google Scholar
Digital Library
Index Terms
(auto-classified)From Image to Translation: Processing the Endangered Nyushu Script
Recommendations
Crowdsourced Monolingual Translation
An enormous potential exists for solving certain classes of computational problems through rich collaboration among crowds of humans supported by computers. Solutions to these problems used to involve human professionals, who are expensive to hire or ...
Action Translate: Supporting Students in Translation Volunteering
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing SystemsUniversity students are well known for volunteering within non-governmental organisations (NGOs). A significant part of NGO practice is the production of documents that communicate their work to local communities and international stakeholders. However, ...
Skeleton-Based Recognition of Chinese Calligraphic Character Image
PCM '08: Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information ProcessingThe large amount of digitized Chinese calligraphic works in existence is a valuable part of the Chinese cultural heritage. But they can hardly be recognized by optical character recognition (OCR) which performs well on machine printed characters against ...






Comments