Abstract
While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustainable. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that could help to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documentation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework for the computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardization proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research.
- Cormac Anderson, Tiago Tresoldi, Thiago Costa Chacon, Anne-Maria Fehn, Mary Walworth, Robert Forkel, and Johann-Mattis List. 2018. A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4, 1 (2018), 21–53.Google Scholar
Cross Ref
- Timotheus A. Bodt and Johann-Mattis List. 2019. Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa languages. Papers in Historical Phonology 4, 1 (2019), 22–44.Google Scholar
Cross Ref
- Bernard Comrie, Martin Haspelmath, and Balthasar Bickel. 2015. Leipzig Glossing Rules. Conventions for Interlinear Morpheme-by-Morpheme Glosses. Max Planck Institute for Evolutionary Anthropology, Leizpig. Retrieved on April 7, 2021 from https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf.Google Scholar
- Jonathan Evans and Jackson T. S. Sun. 2017. Contraction. In Encyclopedia of Chinese Language and Linguistics, Rint Sybesma (Ed.). Vol. 1. Brill, Leiden and Boston, 517–526.Google Scholar
- Robert Forkel and Johann-Mattis List. 2020. CLDFBench. Give your cross-linguistic data a lift. In Proceedings of the 10th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Luxembourg, 1–8.Google Scholar
- Robert Forkel, Johann-Mattis List, Simon J. Greenhill, Christoph Rzymski, Sebastian Bank, Michael Cysouw, Harald Hammarström, Martin Haspelmath, Gereon A. Kaiping, and Russell D. Gray. 2018. Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5, 180205 (2018), 1–10.Google Scholar
Cross Ref
- Volker Gast and Maria Koptjevskaja-Tamm. 2018. The areal factor in lexical typology. Some evidence from lexical databases. In Aspects of Linguistic Variation, Daniël Olmen, Tanja Mortelmans, and Frank Brisard (Eds.). de Gruyter, Berlin and New York, 43–81.Google Scholar
- Harald Hammarström, Robert Forkel, and Martin Haspelmath. 2019. Glottolog 4.0. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from https://glottolog.org.Google Scholar
- Martin Haspelmath and Robert Forkel. 2017. Toward a standard list of grammatical comparative concepts: The Grammaticon. Talk held at the database workshop of the ALT Meeting 2017. Retrieved on April 7, 2021 from http://dynamicsoflanguage.edu.au/storage/alt-2017-database-workshop-book-of-abstracts-forkel-haspelmath-haynie-skirgard.pdf.Google Scholar
- Joshua Conrad Jackson, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill, and Kristen Lindquist. 2019. Emotion semantics show both cultural variation and universal structure. Draft article under review. Science 366, 6472 (2019), 1517–1522.Google Scholar
- Randy J. LaPolla. 1996. A Grammar of Qiang with Annotated Texts and Glossary. City University of Hong Kong, Hong Kong.Google Scholar
- Randy J. LaPolla and Chenglong Huang. 2003. A Grammar of Qiang with Annotated Texts and Glossary. De Gruyter Mouton, Berlin and New York.Google Scholar
- William D. Lewis and Fei Xia. 2010. Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. LLC 25 (2010), 303–319.Google Scholar
Cross Ref
- Johann-Mattis List. 2014. Sequence Comparison in Historical Linguistics. Düsseldorf University Press, Düsseldorf.Google Scholar
- Johann-Mattis List. 2017. A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations. Association for Computational Linguistics, Valencia, 9–12.Google Scholar
Cross Ref
- Johann-Mattis List. 2017. Historical language comparison with LingPy and EDICTOR. DOI:https://doi.org/10.5281/zenodo.1042205Google Scholar
- Johann-Mattis List. 2017. Historical Language Comparison with LingPy and EDICTOR.Google Scholar
- Johann-Mattis List. 2018. Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5, 10 (2018), 1–14. Retrieved on April 7, 2021 from http://hiphilangsci.net/2018/10/31/concept-list-compilation/.Google Scholar
- Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, and Robert Forkel. 2019. Cross-Linguistic Transcription Systems. Max Planck Institute for the Science of Human History, Jena.Google Scholar
- Johann-Mattis List, Michael Cysouw, and Robert Forkel. 2016. Concepticon. A resource for the linking of concept lists. In Proceedings of the 10th International Conference on Language Resources and Evaluation, Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Luxembourg, 2393–2400.Google Scholar
- Johann Mattis List, Simon Greenhill, Christoph Rzymski, Nathanael Schweikhard, and Robert Forkel. 2019. Concepticon. A Resource for the Linking of Concept Lists (Version 2.1.0). Max Planck Institute for the Science of Human History, Jena. DOI:https://doi.org/10.5281/zenodo.3351275Google Scholar
- Johann-Mattis List, Simon Greenhill, Tiago Tresoldi, and Robert Forkel. 2019. LingPy. A Python Library for Quantitative Tasks in Historical Linguistics. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from http://lingpy.org.Google Scholar
- Johann-Mattis List, Simon J. Greenhill, Cormac Anderson, Thomas Mayer, Tiago Tresoldi, and Robert Forkel. forthcoming. CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22, 2 (forthcoming). Retrieved on April 7, 2021 from http://clics.clld.org.Google Scholar
- Johann-Mattis List, Simon J. Greenhill, and Russell D. Gray. 2017. The potential of automatic word comparison for historical linguistics. PLoS One 12, 1 (2017), 1–18.Google Scholar
Cross Ref
- Johann-Mattis List, Nathan W. Hill, and Christopher J. Foster. 2019. Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17, 1 (2019), 26–43.Google Scholar
Cross Ref
- Johann-Mattis List, Philippe Lopez, and Eric Bapteste. 2016. Using sequence similarity networks to identify partial cognates in multilingual wordlists. In Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, Stroudsberg, 599–605.Google Scholar
Cross Ref
- Johann-Mattis List, Christoph Rzymski, Simon Greenhill, Tiago Tresoldi, and Robert Forkel. 2019. CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History, Jena. Retrieved on April 7, 2021 from http://clics.clld.org/.Google Scholar
- Anatole Lyovin. 1969. Review of Hànyǔ fāngyīn zìhuì by Běijīng Dàxué. Language 45, 3 (1969), 687–697. http://www.jstor.org/stable/411456.Google Scholar
Cross Ref
- Steven Moran and Michael Cysouw. 2018. The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles. Language Science Press, Berlin. Retrieved on April 7, 2021 from http://langsci-press.org/catalog/book/176.Google Scholar
- Yugo Murawaki. 2019. Bayesian learning of latent representations of language structures. Journal of Computational Linguistics 45, 2 (2019), 199–228. DOI:https://doi.org/10.1162/COLI a 00346
Google Scholar
Digital Library
- Christoph Rzymski, Tiago Tresoldi, Simon Greenhill, Mei-Shin Wu, Nathanael E. Schweikhard, Maria Koptjevskaja-Tamm, Volker Gast, Timotheus A. Bodt, Abbie Hantgan, Gereon A. Kaiping, Sophie Chang, Yunfan Lai, Natalia Morozova, Heini Arjava, Nataliia Hübler, Ezequiel Koile, Steve Pepper, Mariann Proos, Briana Van Epps, Ingrid Blanco, Carolin Hundt, Sergei Monakhov, Kristina Pianykh, Sallona Ramesh, Russell D. Gray, Robert Forkel, and Johann-Mattis List. 2020. The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies. Scientific Data 7, 13 (2020), 1–12.Google Scholar
Cross Ref
- Laurent Sagart, Guillaume Jacques, Yunfan Lai, Robin Ryder, Valentin Thouzeau, Simon J. Greenhill, and Johann-Mattis List. 2019. Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116, 21 (2019), 10317–10322.Google Scholar
Cross Ref
- Antoinette Schapper. 2019. The ethno-linguistic relationship between smelling and kissing: A Southeast Asian case case-study. Oceanic Linguistics 58, 1 (2019), 92–109.Google Scholar
Cross Ref
- Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21, 2 (1955), 121–137. Google Scholar
Cross Ref
- Mark D. Wilkinson, Michel Dumontier, Ilsbrand J. Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz B. da Silva Santos, Philip E. Bourne, et al. 2016. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3 (2016), 1–8.Google Scholar
Cross Ref
Index Terms
Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation
Recommendations
Enriching a massively multilingual database of interlinear glossed text
The majority of the world's languages have little to no NLP resources or tools. This is due to a lack of training data ("resources") over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to ...
Combining Documentation and Research: Ongoing Work on an Endangered Language
IALP '12: Proceedings of the 2012 International Conference on Asian Language ProcessingThis paper is intended for an audience of speech technology specialists who believe that "automatic processing of under-resourced languages is a way to study language diversity with a multi-disciplinary view" (L. Besacier, keynote speech at this ...






Comments