Abstract
Over the past decade, through a mixture of optical character recognition and manual input, there is now a growing corpus of Tibetan literature available as e-texts in Unicode format. With the creation of such a corpus, the techniques of text analytics that have been applied in the analysis of English and other modern languages may now be applied to Tibetan. In this work, we narrow our focus to examine a modest portion of that literature, the Mind-section portion of the literature of the Tibetan tradition of the Great Perfection. Here, we will use the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection. It has been necessary for us to participate in all portions of this process: corpora identification and text edition selection, rendering the text as e-texts in Unicode using both Optical Character Recognition and manual entry, data cleaning and transformation, implementation of software for text analysis, and interpretation of results. For this reason, we hope this study can serve as a model for other low-resource languages that are just beginning to approach the problem of providing text analytics for their language.
- Jean-Luc Achard. 1997. L'Essence Perlée du Secret. Brepols, Turnhout, Belgium.Google Scholar
- Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for document classification. arXiv:1904.08398 (2019).Google Scholar
- Charles Bazerman. 2003. Intertextuality: How texts rely on other texts. In What Writing Does and How It Does It. Routledge, 89–102.Google Scholar
Cross Ref
- David Beavan. 2008. Glimpses though the clouds: Collocates in a new light. In Digital Humanities 2008. University of Oulu, 53.Google Scholar
- Marcus Bingenheimer, Jen-Jou Hung, and Cheng-en Hsieh. 2017. Stylometric analysis of Chinese Buddhist texts—Do different Chinese translations of the Gaṇḍavyūha reflect stylistic features that are typical for their age? J. Japan. Assoc. Dig. Human. 2, 1 (2017), 1–30. DOI:https://doi.org/10.17928/jjadh.2.1_1Google Scholar
- J. N. G. Binongo and M. W. A. Smith. 1999. The application of principal component analysis to stylometry. Lit. Ling. Comput. 14, 4 (1999), 445–466. DOI:https://doi.org/10.1093/llc/14.4.445Google Scholar
Cross Ref
- Steven Bird, Edward Loper, and Ewan Klein. Natural Language Processing with Python. O'Reilly Media, Inc. Google Scholar
Digital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, (2003), 993–1022. Google Scholar
Digital Library
- Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning. ACM, 104–111. Google Scholar
Digital Library
- Krystyna Cech. 1986. The history, teaching and practice of dialectics according to the Bon tradition. Tibet J. 11, 2 (1986), 3–28.Google Scholar
- Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y-Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Iberoamerican Congress on Pattern Recognition. Springer 844–853. Google Scholar
Digital Library
- Weiwei Cui, Shixia Liu, Zhuofeng Wu, and Hao Wei. 2014. How hierarchical topics evolve in large text corpora. IEEE Trans. Visualiz. Comput. Graph. 20, 12 (2014), 2281–2290. DOI:https://doi.org/10.1109/TVCG.2014.2346433Google Scholar
Cross Ref
- Stefan Debortoli, Oliver Müller, Iris Junglas, and Jan vom Brocke. 2016. Text mining for information systems researchers: An annotated topic modeling tutorial. Commun. Assoc. Inf. Syst. 39, 1 (2016). DOI:https://doi.org/10.17705/1CAIS.03907Google Scholar
- Drang-srong-rnam-rgyal and Sga-ston Tshul-khrims-rgyal-mtshan. 2009. Bon gyi dpe dkon phyogs bsgrigs/Collection of rare Bonpo texts. Vajra Publications.Google Scholar
- Jeffrey Drouin. 2014. Close-and distant-reading modernism: Network analysis, text mining, and teaching the little review. J. Mod. Period. Stud. 5, 1 (2014), 110–135.Google Scholar
Cross Ref
- Drupchen Élie Roux, Ngawang Trinley, and Joyce Mackzenzie. 2019. Esukhia/pybo. Esukhia. Retrieved from: https://github.com/Esukhia/pybo.Google Scholar
- Zhao Geng, Tom Cheesman, Robert S. Laramee, Kevin Flanagan, and Stephan Thiel. 2015. ShakerVis: Visual analysis of segment variation of German translations of Shakespeare's Othello. Inf. Visualiz. 14, 4 (2015), 273–288. DOI:https://doi.org/10.1177/1473871613495845Google Scholar
Cross Ref
- David Germano. 2005. The funerary transformation of the great perfection (Rdzogs chen). J. Int. Assoc. Tibetan Stud. 1, (2005), 1–54.Google Scholar
- Wael H. Gomaa and Aly A. Fahmy. 2013. A survey of text similarity approaches. Int. J. Comput. Applic. 68, 13 (2013), 13–18.Google Scholar
Cross Ref
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press. Google Scholar
Digital Library
- Chodrak Gyatso. Volume 1 Mahamudra text collection. Dharmadownloads. Retrieved from: http://www.dharmadownload.net/pages/english/mahamudra/01_mahamudra%20Jazhung/001_mahamudra_jazhung.htm.Google Scholar
- Zellig S. Harris. 1954. Distributional structure. Word 10, 2–3 (1954), 146–162.Google Scholar
Cross Ref
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media.Google Scholar
- John D. Hunter. 2007. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9, 3 (2007), 90–95. DOI:https://doi.org/10.1109/MCSE.2007.55 Google Scholar
Digital Library
- Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
- Paul Jaccard. 1902. Lois de distribution florale dans la zone alpine. Bull. Soc. Vaudoise Sci. Nat. 38, (1902), 69–130.Google Scholar
- S. Jänicke, G. Franzini, M. F. Cheema, and G. Scheuermann. 2017. Visual text analysis in digital humanities. Comput. Graph. Forum 36, 6 (2017), 226–250. DOI:https://doi.org/10.1111/cgf.12873 Google Scholar
Digital Library
- Stefan Jänicke, Greta Franzini, Muhammad Faisal Cheema, and Gerik Scheuermann. 2015. On close and distant reading in digital humanities: A survey and future challenges. In Eurographics Conference on Visualization (EuroVis) – STARs. 83–103.Google Scholar
- Stefan Jänicke, Annette Geßner, Marco Büchler, and Gerik Scheuermann. 2014. Visualizations for text re-use. In International Conference on Information Visualization Theory and Applications (IVAPP’14). 59–70.Google Scholar
- Matthew L. Jockers. 2012. Computing and visualizing the 19th-century literary genome. In Proceedings of the Digital Humanities. 242–244.Google Scholar
- Samten Karmay. 2007. The Great Perfection (rDzogs chen): A Philosophical and Meditative Teaching of Tibetan Buddhism. Second edition. Brill. Retrieved from: https://brill.com/view/title/12880.Google Scholar
- Samten Gyaltsen Karmay. 2005. The Treasury of Good Sayings: A Tibetan History of Bon. Motilal Banarsidass Publishing House.Google Scholar
- Kurt Keutzer. 2012. The nine cycles of the hidden, the nine mirrors, and nine minor texts on mind: Early mind section literature in Bon. Revue d'Etudes Tibétaines 24, (2012), 165–201.Google Scholar
- Kurt Keutzer. 2020. keutzer/bo-corpus-analytics. Retrieved from: https://github.com/keutzer/bo-corpus-analytics.Google Scholar
- Karen Liljenberg. 2012. A critical study of the thirteen later translations of the dzogchen mind series. PhD Dissertation. SOAS University of London, London, UK. Retrieved from: https://eprints.soas.ac.uk/15851/.Google Scholar
- Dan Martin. 2001. Unearthing Bon Treasures: Life and Contested Legacy of a Tibetan Scripture Revealer, with a General Bibliography of Bon. Brill.Google Scholar
- Klaus-Dieter Mathes. 2011. The collection of “indian mahamudra works” (phyag chen rgya gzhung) compiled by the seventh karma pa chos grags rgya mtsho. In Mahāmudrā and the Bka’-brgyud Tradition, Roger R. Jackson and Matthew T. Kapstein (Eds.). International Institute for Tibetan and Buddhist Studies. Andiast. S, 89–127.Google Scholar
- Michael Waskom, Olga Botvinnik, Joel Ostblom, Maoz Gelbart, Saulius Lukauskas, Paul Hobson, David C. Gemperline, Tom Augspurger, Yaroslav Halchenko, John B. Cole, Jordi Warmenhoven, Julian de Ruiter, Cameron Pye, Stephan Hoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric Quintero, Pete Bachant, Marcel Martin, Kyle Meyer, Corban Swain, Alistair Miles, Thomas Brunner, Drew O'Kane, Tal Yarkoni, Mike Lee Williams, Constantine Evans, Clark Fitzgerald, and Brian. 2020. mwaskom/seaborn: v0.10.1 (April 2020). Zenodo. DOI:https://doi.org/10.5281/zenodo.3767070Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Franco Moretti. 2016. Distant Reading. Konstanz University Press. Retrieved from: https://kops.uni-konstanz.de/handle/123456789/35563.Google Scholar
- Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem. J. Amer. Statist. Assoc. 58, 302 (1963), 275–309. DOI:https://doi.org/10.2307/2283270Google Scholar
- Trevor Muñoz. 2013. Data curation as publishing for the digital humanities. J. Dig. Hum. 2, 3 (2013), 14–22.Google Scholar
- Aditi Muralidharan and Marti A. Hearst. 2013. Supporting exploratory text analysis in literature study. Lit. Ling. Comput. 28, 2 (2013), 283–295. DOI:https://doi.org/10.1093/llc/fqs044Google Scholar
Cross Ref
- Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016. Learning word embeddings from Wikipedia for content-based recommender systems. In Advances in Information Retrieval (Lecture Notes in Computer Science). Springer International Publishing, Cham, 729–734. DOI:https://doi.org/10.1007/978-3-319-30671-1_60Google Scholar
- Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. 2018. Surveying stylometry techniques and applications. ACM Comput. Surv. 50, 6 (2018), 86. Google Scholar
Digital Library
- Thubten Nyima. 2009. Snga ’Gyur Rgyud ’Bum Phyogs Bsgrigs. Mi rigs dpe skrun khang, Pe cin.Google Scholar
- Travis E. Oliphant. 2006. A Guide to NumPy. Trelgol Publishing, USA. Google Scholar
Digital Library
- Morten Ostensen. 2018. Reconsidering the contents and function of the rdzogs chen classifications of sems phyogs and sems sde. Rev. d'Etudes Tibétaines (2018), 32.Google Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 85 (2011), 2825–2830. Google Scholar
Digital Library
- Fuchun Peng, Dale Schuurmans, and Shaojun Wang. 2004. Augmenting naive Bayes classifiers with statistical language models. Inf. Ret. 7, 3–4 (2004), 317–345. Google Scholar
Digital Library
- Radim Rehurek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50.Google Scholar
- Drang-srong rNam-rgyal and Sga-ston Tshul-khrims-rgyal-mtshan. 2009. Bon gyi dpe dkon phyogs bsgrigs = Collection of rare Bonpo texts. Vajra Publications.Google Scholar
- Zach Rowinski and Kurt Keutzer. 2016. Namsel: An optical character recognition system for Tibetan text. Himal. Ling. 15, 1 (2016). Retrieved from: http://escholarship.org/uc/item/6d5781k5.pdf.Google Scholar
- Sam van Schaik. 2014a. The Tibetan Chan Manuscripts: A Complete Descriptive Catalogue of Tibetan Chan Texts in the Dunhuang Manuscript Collections. Sinor Research Institute for Inner Asian Studies Indiana University.Google Scholar
- Sam van Schaik. 2014b. Transliterations of Tibetan Chan manuscripts in the Stein and Pelliot collections. Retrieved from: http://idp.bl.uk/database/oo_cat.a4d?shortref=TibetanChanTransliterations_2014.Google Scholar
- Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT.Google Scholar
- Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Docum. 28, 1 (1972), 11–21.Google Scholar
Cross Ref
- Smon rgyal lha sras (Ed.). 1999. theg chen g.yung drung bon gyi bka’ ’gyur. si khron zhing chen par khrun lte gnas par ’debs khang.Google Scholar
- Nicolas Tournadre. 2014. The Tibetic languages and their classification. In Trans-Himalayan Linguistics: Historical and Descriptive Linguistics of the Himalayan Area. De Gruyter, 105–129.Google Scholar
- Sam Van Schaik. 2004. The early days of the great perfection. J. Int. Assoc. Buddhist Stud. 27, 1 (2004), 165–206.Google Scholar
- Daniel Veidlinger (Ed.). 2019. Digital Humanities and Buddhism: An Introduction (1st ed.). De Gruyter.Google Scholar
- Vimalamitra. 2016. rdzogs chen rgyud bcu bdun volume 1. Si khron mi rigs dpe skrun khang, Khren tu'u.Google Scholar
- Vimalamitra. 2016. rdzogs chen rgyud bcu bdun volume 2. Si khron mi rigs dpe skrun khang, Khren tu'u.Google Scholar
- Ronald L. Wasserstein and Nicole A. Lazar. 2016. The ASA statement on p-values: Context, process, and purpose. Ame. Statist. 70, 2 (2016), 129–133. DOI:https://doi.org/10.1080/00031305.2016.1154108Google Scholar
- Mark Wolff. 2013. Surveying a corpus with alignment visualization and topic modeling. In Proceedings of the Digital Humanities Conference. 546.Google Scholar
- 2002. rdzogs pa chen po zhang zhung snyan rgyud bka’ rgyud skor bzhi. Triten Norbutse Library, Kathmandu, Nepal.Google Scholar
- 2003. bka’ ’gyur (dpe sdur ma). krung go'i bod rig pa'i dpe skrun khang, Beijing. Tibetan Buddhist Resource Center ID: W1PD96682.Google Scholar
- 2005. bla med rtdzogs pa chen po'i bka’ sems smad sde dgu'i skor bzhugs so (First ed.). Triten Norbutse Library, Kathmandu, Nepal.Google Scholar
- 2015. gsung rab sgo mdzod rin po che'i glegs bam. kan su'u mi rogs dpe skrun khang, Lanzhou.Google Scholar
- Tsadra Foundation's Treasury of Precious Instructions Cataloging Project. Retrieved from https://dnz.tsadra.org/index.php/Main_Page.Google Scholar
- Vairocana. 1971. The Rgyud 'bum of Vairocana: A Collection of Ancient Tantras and Esoteric Instructions Compiled and Translated by the Eighth Century Tibetan Master. S. W. Tashigangpa, Leh, Ladakh.Google Scholar
Index Terms
Applying Text Analytics to the Mind-section Literature of the Tibetan Tradition of the Great Perfection
Recommendations
Text Analytics in Bulgarian: An Overview and Future Directions
AbstractText analytics is becoming an integral part of modern business and economic research and analysis. However, the extent to which its application is possible and accessible varies for different languages. The main goal of this paper is to outline ...
Transliteration recognition of Tibetan person name based on Tibetan cultural knowledge
Institute of Computing Technology, Chinese Lexical Analysis System (ICTCLAS) is a common tool for Chinese word segmentation and named entity recognition. With this tool, the F1 value of person name recognition from Chinese texts in Tibetan culture is only ...
Study on Printed Tibetan Character Recognition
AICI '10: Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence - Volume 01Owing to special structure Tibetan characters, the recognition of traditional Tibetan characters encounters the problems of low recognition rates and poor recognition effects. Through conducting an in-depth study on features of the printed Tibetan ...






Comments