skip to main content
research-article
Open Access

Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

Published:23 April 2021Publication History
Skip Abstract Section

Abstract

While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustainable. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that could help to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documentation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework for the computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardization proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research.

References

  1. Cormac Anderson, Tiago Tresoldi, Thiago Costa Chacon, Anne-Maria Fehn, Mary Walworth, Robert Forkel, and Johann-Mattis List. 2018. A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4, 1 (2018), 21–53.Google ScholarGoogle ScholarCross RefCross Ref
  2. Timotheus A. Bodt and Johann-Mattis List. 2019. Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa languages. Papers in Historical Phonology 4, 1 (2019), 22–44.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bernard Comrie, Martin Haspelmath, and Balthasar Bickel. 2015. Leipzig Glossing Rules. Conventions for Interlinear Morpheme-by-Morpheme Glosses. Max Planck Institute for Evolutionary Anthropology, Leizpig. Retrieved on April 7, 2021 from https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf.Google ScholarGoogle Scholar
  4. Jonathan Evans and Jackson T. S. Sun. 2017. Contraction. In Encyclopedia of Chinese Language and Linguistics, Rint Sybesma (Ed.). Vol. 1. Brill, Leiden and Boston, 517–526.Google ScholarGoogle Scholar
  5. Robert Forkel and Johann-Mattis List. 2020. CLDFBench. Give your cross-linguistic data a lift. In Proceedings of the 10th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Luxembourg, 1–8.Google ScholarGoogle Scholar
  6. Robert Forkel, Johann-Mattis List, Simon J. Greenhill, Christoph Rzymski, Sebastian Bank, Michael Cysouw, Harald Hammarström, Martin Haspelmath, Gereon A. Kaiping, and Russell D. Gray. 2018. Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5, 180205 (2018), 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  7. Volker Gast and Maria Koptjevskaja-Tamm. 2018. The areal factor in lexical typology. Some evidence from lexical databases. In Aspects of Linguistic Variation, Daniël Olmen, Tanja Mortelmans, and Frank Brisard (Eds.). de Gruyter, Berlin and New York, 43–81.Google ScholarGoogle Scholar
  8. Harald Hammarström, Robert Forkel, and Martin Haspelmath. 2019. Glottolog 4.0. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from https://glottolog.org.Google ScholarGoogle Scholar
  9. Martin Haspelmath and Robert Forkel. 2017. Toward a standard list of grammatical comparative concepts: The Grammaticon. Talk held at the database workshop of the ALT Meeting 2017. Retrieved on April 7, 2021 from http://dynamicsoflanguage.edu.au/storage/alt-2017-database-workshop-book-of-abstracts-forkel-haspelmath-haynie-skirgard.pdf.Google ScholarGoogle Scholar
  10. Joshua Conrad Jackson, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill, and Kristen Lindquist. 2019. Emotion semantics show both cultural variation and universal structure. Draft article under review. Science 366, 6472 (2019), 1517–1522.Google ScholarGoogle Scholar
  11. Randy J. LaPolla. 1996. A Grammar of Qiang with Annotated Texts and Glossary. City University of Hong Kong, Hong Kong.Google ScholarGoogle Scholar
  12. Randy J. LaPolla and Chenglong Huang. 2003. A Grammar of Qiang with Annotated Texts and Glossary. De Gruyter Mouton, Berlin and New York.Google ScholarGoogle Scholar
  13. William D. Lewis and Fei Xia. 2010. Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. LLC 25 (2010), 303–319.Google ScholarGoogle ScholarCross RefCross Ref
  14. Johann-Mattis List. 2014. Sequence Comparison in Historical Linguistics. Düsseldorf University Press, Düsseldorf.Google ScholarGoogle Scholar
  15. Johann-Mattis List. 2017. A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations. Association for Computational Linguistics, Valencia, 9–12.Google ScholarGoogle ScholarCross RefCross Ref
  16. Johann-Mattis List. 2017. Historical language comparison with LingPy and EDICTOR. DOI:https://doi.org/10.5281/zenodo.1042205Google ScholarGoogle Scholar
  17. Johann-Mattis List. 2017. Historical Language Comparison with LingPy and EDICTOR.Google ScholarGoogle Scholar
  18. Johann-Mattis List. 2018. Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5, 10 (2018), 1–14. Retrieved on April 7, 2021 from http://hiphilangsci.net/2018/10/31/concept-list-compilation/.Google ScholarGoogle Scholar
  19. Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, and Robert Forkel. 2019. Cross-Linguistic Transcription Systems. Max Planck Institute for the Science of Human History, Jena.Google ScholarGoogle Scholar
  20. Johann-Mattis List, Michael Cysouw, and Robert Forkel. 2016. Concepticon. A resource for the linking of concept lists. In Proceedings of the 10th International Conference on Language Resources and Evaluation, Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Luxembourg, 2393–2400.Google ScholarGoogle Scholar
  21. Johann Mattis List, Simon Greenhill, Christoph Rzymski, Nathanael Schweikhard, and Robert Forkel. 2019. Concepticon. A Resource for the Linking of Concept Lists (Version 2.1.0). Max Planck Institute for the Science of Human History, Jena. DOI:https://doi.org/10.5281/zenodo.3351275Google ScholarGoogle Scholar
  22. Johann-Mattis List, Simon Greenhill, Tiago Tresoldi, and Robert Forkel. 2019. LingPy. A Python Library for Quantitative Tasks in Historical Linguistics. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from http://lingpy.org.Google ScholarGoogle Scholar
  23. Johann-Mattis List, Simon J. Greenhill, Cormac Anderson, Thomas Mayer, Tiago Tresoldi, and Robert Forkel. forthcoming. CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22, 2 (forthcoming). Retrieved on April 7, 2021 from http://clics.clld.org.Google ScholarGoogle Scholar
  24. Johann-Mattis List, Simon J. Greenhill, and Russell D. Gray. 2017. The potential of automatic word comparison for historical linguistics. PLoS One 12, 1 (2017), 1–18.Google ScholarGoogle ScholarCross RefCross Ref
  25. Johann-Mattis List, Nathan W. Hill, and Christopher J. Foster. 2019. Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17, 1 (2019), 26–43.Google ScholarGoogle ScholarCross RefCross Ref
  26. Johann-Mattis List, Philippe Lopez, and Eric Bapteste. 2016. Using sequence similarity networks to identify partial cognates in multilingual wordlists. In Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, Stroudsberg, 599–605.Google ScholarGoogle ScholarCross RefCross Ref
  27. Johann-Mattis List, Christoph Rzymski, Simon Greenhill, Tiago Tresoldi, and Robert Forkel. 2019. CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from http://clics.clld.org/.Google ScholarGoogle Scholar
  28. Anatole Lyovin. 1969. Review of Hànyǔ fāngyīn zìhuì by Běijīng Dàxué. Language 45, 3 (1969), 687–697. http://www.jstor.org/stable/411456.Google ScholarGoogle ScholarCross RefCross Ref
  29. Steven Moran and Michael Cysouw. 2018. The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles. Language Science Press, Berlin. Retrieved on April 7, 2021 from http://langsci-press.org/catalog/book/176.Google ScholarGoogle Scholar
  30. Yugo Murawaki. 2019. Bayesian learning of latent representations of language structures. Journal of Computational Linguistics 45, 2 (2019), 199–228. DOI:https://doi.org/10.1162/COLI a 00346 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christoph Rzymski, Tiago Tresoldi, Simon Greenhill, Mei-Shin Wu, Nathanael E. Schweikhard, Maria Koptjevskaja-Tamm, Volker Gast, Timotheus A. Bodt, Abbie Hantgan, Gereon A. Kaiping, Sophie Chang, Yunfan Lai, Natalia Morozova, Heini Arjava, Nataliia Hübler, Ezequiel Koile, Steve Pepper, Mariann Proos, Briana Van Epps, Ingrid Blanco, Carolin Hundt, Sergei Monakhov, Kristina Pianykh, Sallona Ramesh, Russell D. Gray, Robert Forkel, and Johann-Mattis List. 2020. The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies. Scientific Data 7, 13 (2020), 1–12.Google ScholarGoogle ScholarCross RefCross Ref
  32. Laurent Sagart, Guillaume Jacques, Yunfan Lai, Robin Ryder, Valentin Thouzeau, Simon J. Greenhill, and Johann-Mattis List. 2019. Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116, 21 (2019), 10317–10322.Google ScholarGoogle ScholarCross RefCross Ref
  33. Antoinette Schapper. 2019. The ethno-linguistic relationship between smelling and kissing: A Southeast Asian case case-study. Oceanic Linguistics 58, 1 (2019), 92–109.Google ScholarGoogle ScholarCross RefCross Ref
  34. Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21, 2 (1955), 121–137. Google ScholarGoogle ScholarCross RefCross Ref
  35. Mark D. Wilkinson, Michel Dumontier, Ilsbrand J. Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz B. da Silva Santos, Philip E. Bourne, et al. 2016. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3 (2016), 1–8.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 2
      March 2021
      313 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3454116
      Issue’s Table of Contents

      Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 April 2021
      • Online AM: 7 May 2020
      • Revised: 1 March 2020
      • Accepted: 1 March 2020
      • Received: 1 November 2019
      Published in tallip Volume 20, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)153
      • Downloads (Last 6 weeks)13

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!