skip to main content
research-article

From Image to Translation: Processing the Endangered Nyushu Script

Published:16 May 2016Publication History
Skip Abstract Section

Abstract

The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: 女书; literally: “women’s writing”) as a case study to present the first computational approach that combines Computer Vision and Natural Language Processing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new challenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance indicates that the system is a promising approach and can serve as a standard benchmark.

References

  1. Zhen-Long Bai and Qiang Huo. 2005. A study on the use of 8-directional features for online handwritten Chinese character recognition. In Proc. of ICDAR 2005. 262--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stephen Beale. 2014. Time to change the “D” in “DEL”. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  3. Emily Bender. 2008. Evaluating a crosslinguistic grammar resource: A case study of wambaya. In the Proc. of ACL 2008.Google ScholarGoogle Scholar
  4. Emily Bender, Joshua Crowgey, Michael Wayne Goodman, and Fei Xia. 2014. Learning grammar specifications from IGT: A case study of chintang. In the Proc. of ACL 2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  5. Emily Bender, Michael Wayne Goodman, Joshua Crowgey, and Fei Xia. 2013. Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proc. of LaTeCH2013.Google ScholarGoogle Scholar
  6. Martin Benjamin and Paula Radetzky. 2014. Small languages, big data: Multilingual computational tools and techniques for the lexicography of endangered languages. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  7. Steven Bird. 2009. Natural language processing and linguistic fieldwork. Computational Linguistics 35, 3 (2009), 469--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Steven Bird and David Chiang. 2012. Machine translation for language preservation. In Proc. of ICCL2012.Google ScholarGoogle Scholar
  9. Steven Bird, Florian R. Hanke1, Oliver Adams, and Haejoong Lee. 2014. Aikuma: A mobile app for collaborative language documentation. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  10. David Bradley. 2005. Introduction: Language policy and language endangerment in china. International Journal of the Sociology of Language 12, 173 (2005), 1--21.Google ScholarGoogle ScholarCross RefCross Ref
  11. Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (Oct. 2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Francisco Casacuberta. 2001. Finite-state transducers for speech-input translation. In Proc. of ASRU2013.Google ScholarGoogle ScholarCross RefCross Ref
  13. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Qiang Fu, Xiaoqing Ding, Tong Liu, Yan Jiang, and Zheng Ren. 2006. A novel segmentation and recognition algorithm for chinese handwritten address character strings. In Proc. of ICPR2006, IEEE (Ed.), Vol. 2. 974--977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jonathan Graehl. 1997. Carmel finite-state toolkit. http://www.isi.edu/licensed-sw/carmel/. (1997).Google ScholarGoogle Scholar
  16. Zhi Han, Chang-Ping Liu, and Xu-Cheng Yin. 2005. A two-stage handwritten character segmentation approach in mail address recognition. In Proc. of ICDAR2005, Vol. 1. IEEE Computer Society, 111--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jiren He and Zuyi Jiang. 1985. Naxi Language Briefing. Minzu Press.Google ScholarGoogle Scholar
  18. Bufan Huang. 1985. An overview of muya language. National Languages 8 (1985).Google ScholarGoogle Scholar
  19. Xuezhen Huang. 1993. Jiangyong dialect research. Social Science Press (1993).Google ScholarGoogle Scholar
  20. Russell A. Kirsch. 1971. Computer determination of the constituent structure of biological images. Computers and Biomedical Research 4, 3 (1971), 315--328.Google ScholarGoogle Scholar
  21. Kevin Knight and Yaser Al-Onaizan. 1998. Translation with finite-state devices. In Proc. of AMTA1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Michael Krauss. 1992. The world’s languages in crisis. Languages 68, 1 (1992), 4--10.Google ScholarGoogle Scholar
  23. Shankar Kumar and William Byrne. 2003. A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In Proc. of NAACL-HLT 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Khang Nhut Lam, Feras Al Tarouti, and Jugal Kalita. 2014. Creating lexical resources for endangered languages. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  25. Fang-kuei Li. 2005. Po-ai dialect. Tsinghua University Press (2005).Google ScholarGoogle Scholar
  26. Cheng-Lin Liu, Stefan Jaeger, and Masaki Nakagawa. 2004. Online recognition of chinese characters: The state-of-the-art. Trans. PAMI 26, 2 (2004), 198--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Cheng-Lin Liu, Masashi Koga, Hiroshi Sako, and Hiromichi Fujisawa. 2000. Aspect ratio adaptive normalization for handwritten character recognition. In Advances in Multimodal Interfaces—ICMI 2000, Tieniu Tan, Yuanchun Shi, and Wen Gao (Eds.). Lecture Notes in Computer Science, Vol. 1948. Springer Berlin Heidelberg, 418--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. 2013. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognition 46, 1 (2013), 155--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Edward O. Ombui1, Peter W. Wagacha, and Wanjiku Nganga. 2014. InterlinguaPlus machine translation approach for under-resourced languages: Ekegusii and swahili. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle Scholar
  30. Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. Systems, Man and Cybernetics, IEEE Transactions on 9, 1 (Jan 1979), 62--66. DOI:http://dx.doi.org/10.1109/TSMC.1979.4310076Google ScholarGoogle Scholar
  31. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. of ACL2002. 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. In Proc. of the Royal Society of London. 240--242.Google ScholarGoogle Scholar
  33. Hammam Riza. 2008. Indigenous languages of Indonesia: Creating language resources for language preservation. In Proc. of IJCNLP2008, Workshop on NLP for Less Privileged Languages.Google ScholarGoogle Scholar
  34. Hongkai Sun. 1983. Overview of ersu language. Language Research (1983).Google ScholarGoogle Scholar
  35. Hongkai Sun, Zengyi Hu, and Xing Huang. 2007. Chinese Languages. Commercial Press.Google ScholarGoogle Scholar
  36. Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction method for character recognition - A survey. Pattern Recognition 29, 4 (1996), 641--662.Google ScholarGoogle ScholarCross RefCross Ref
  37. Koji Tsuda and Bernhard Schölkopf. 2004. A primer on kernel methods. In Kernel Methods in Computational Biology. MIT Press, 35--70.Google ScholarGoogle Scholar
  38. Morgan Ulinski, Anusha Balakrishnan, Daniel Bauer, Bob Coyne, Julia Hirschberg, and Owen Rambow. 2014. Documenting endangered languages with the wordsEye linguistics tool. In Proc. of ACL2014, Workshop on ComputEL.Google ScholarGoogle ScholarCross RefCross Ref
  39. Qiu-Feng Wang, Fei Yin, and Cheng-Lin Liu. 2012. Handwritten chinese text recognition by integrating multiple contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 8 (2012), 1469--1481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Junru Zhang. 1980. Shuiyu Briefing. Minzu Press.Google ScholarGoogle Scholar
  41. Liming Zhao. 1995. Nyushu and Nyushu Culture. Xinhua Press.Google ScholarGoogle Scholar
  42. Liming Zhao. 2004a. The Comparison of Nyushu Characters. Intellectual Property Press.Google ScholarGoogle Scholar
  43. Liming Zhao. 2004b. Research on the Characters in the Nyushu Script by the one Hundred Years Old Lady Yang Huanyi. International Culture Publishing House.Google ScholarGoogle Scholar
  44. Liming Zhao. 2005. Chinese Nyushu Script Collection. Zhonghua Book Company.Google ScholarGoogle Scholar
  45. Liming Zhao. 2008. Nyushu Booklet. Hunan People’s Press.Google ScholarGoogle Scholar
  46. Liming Zhao and Zhaolin Song. 2011. A Map Record of the Endangered Languages in Southwestern China. Xueyuan Press.Google ScholarGoogle Scholar
  47. Liming Zhao and Yan Zhang. 2014. The collection of endangered literature from the minority groups in southwestern china -- namuyi-tibetan bozi literature. Guangxi Normal University Press (2014).Google ScholarGoogle Scholar
  48. Shuyan Zhao, Zheru Chi, Penfei Shi, and Hong Yan. 2003. Two-stage segmentation of unconstrained handwritten chinese characters. Pattern Recognition 36, 1 (2003), 145--156.Google ScholarGoogle ScholarCross RefCross Ref
  49. Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In Proc. of ICML2007. ACM, New York, NY, USA, 1151--1157. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. From Image to Translation: Processing the Endangered Nyushu Script

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
        June 2016
        173 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/2915955
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 May 2016
        • Accepted: 1 December 2015
        • Revised: 1 October 2015
        • Received: 1 May 2015
        Published in tallip Volume 15, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!