10.5555/1603899.1603923dlproceedingsArticle/Chapter ViewAbstractPublication Pagesnemlap-conllConference Proceedings
research-article
Free Access

Cross-entropy and linguistic typology

Published:11 January 1998

ABSTRACT

The idea of "familial relationships" among languages is well-established and accepted, although some controversies persist in a few specific instances. By painstakingly recording and identifying regularities and similarities and comparing these to the historical record, linguists have been able to produce a general "family tree" incorporating most natural languages.

We suggest here that much of these trees can be automatically determined by a complementary technique of distributional analysis. Recent work by (Farach et al., 1995) and (Juola, 1997) suggests that Kullback-Leibler divergence (or cross-entropy) can be meaningfully measured from small samples, in some cases as small as only 20 or so words. Using these techniques, we define and measure a distance function between translations of a small corpus (c. 70 words/sample) covering much of the accepted Indo-European family, and reconstruct a relationship tree by hierarchical cluster analysis. The resulting tree shows remarkable similarity to the accepted Indo-European family; this we read as evidence both for the immense power of this measurement technique and for the validity of this kind of mechanical similarity judgement in the identification of typological relationships. Furthermore, this technique is in theory sensitive to different sorts of relationships than more common word-list based methods and may help illuminate these from a different direction.

References

  1. Ronald Eaton Asher and J. M. Y. Simpson, editors. 1994. The Encyclopedia of Language and Linguistics. Pergamon, Oxford.Google ScholarGoogle Scholar
  2. Christopher M. Bishop. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. William Bright, editor. 1992. International Encyclopedia of Linguistics. Oxford University Press, Oxford.Google ScholarGoogle Scholar
  4. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Jennifer C. Lai, and Robert L. Mercer. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David Crystal. 1987. The Cambridge Encyclopedia of Language. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  6. Martin Farach, Michiel Noordewier, Serap Savari, Lary Shepp, Abraham Wyner, and Jacob Ziv. 1995. On the entropy of DNA: Algorithms and measurements based on memory and rapid convergence. In Proceedings of the 6th Annual Symposium on Discrete Algorithms (SODA95). ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edward Finegan and Niko Besnier. 1987. Language, Its Structure and Use. Harcourt Brace Jovanovich, San Diego.Google ScholarGoogle Scholar
  8. Peter Forster, Alfred Toth, and Hans-Juergen Bandelt. in press. Phylogenetic network analysis of word lists. Journal of Quantitative Linguistics.Google ScholarGoogle Scholar
  9. H. A. Gleason. 1955. Introduction to Descriptive Linguistics. Holt, Rinehart and Winston, New York.Google ScholarGoogle Scholar
  10. Patrick Juola. 1997. What can we do with small corpora? Document categorization via cross-entropy. In Proceedings of an Interdisciplinary Workshop on Similarity and Categorization, Edinburgh, UK. Department of Artificial Intelligence, University of Edinburgh.Google ScholarGoogle Scholar
  11. Donald A. Ringe. 1992. On calculating the factor of chance in language comparison, volume 82 of Transactions of the American Philosophical Society. American Philosophical Society.Google ScholarGoogle Scholar
  12. Claude Elmwood Shannon. 1948. A mathematical theory of communication. Bell System Technical Journal, 27:379--423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Claude Elmwood Shannon. 1951. Prediction and entropy of printed English. Bell System Technical Journal, 30:50--64.Google ScholarGoogle ScholarCross RefCross Ref
  14. Morris Swadesh. 1955. Towards greater accuracy in lexicostatic dating. International Journal of American Linguistics, 21:121--37.Google ScholarGoogle ScholarCross RefCross Ref
  15. Tandy Warnow. 1997. Mathematical approaches to comparative linguistics. Proceedings of the National Academy of Sciences of the USA, 94:6585--90.Google ScholarGoogle ScholarCross RefCross Ref
  16. Abraham J. Wyner. in press. Entropy estimation and patterns.Google ScholarGoogle Scholar

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    DL Hosted proceedings cover image
    NeMLaP3/CoNLL '98: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
    January 1998
    332 pages
    ISBN:0725806346

    Publisher

    Unknown publishers

    Publication History

    • Published: 11 January 1998

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)4

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!