skip to main content
short-paper

Native Language Identification of Fluent and Advanced Non-Native Writers

Authors Info & Claims
Published:11 April 2020Publication History
Skip Abstract Section

Abstract

Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.

References

  1. Ahmed Abbasi and Hsinchun Chen. 2005. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 5 (2005), 67--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a treebank for french. In Treebanks. 165--187.Google ScholarGoogle Scholar
  3. Sophia Ananiadou, Paul Thompson, and Raheel Nawaz. 2013. Enhancing search: events and their discourse context. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13). 318--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Riza Theresa Batista-Navarro, Georgios Kontonatsios, Claudiu Mihaila, Paul Thompson, Rafal Rak, Raheel Nawaz, Ioannis Korkontzelos, and Sophia Ananiadou. 2013. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13). 559--571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Stephen D. Bay. 1999. Nearest neighbor classification from multiple feature subsets. Intell. Data Anal. 3, 3 (1999), 191--209.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native english. ETS Res. Rep. Ser. 2013, 2 (2013), i--15.Google ScholarGoogle ScholarCross RefCross Ref
  7. Lingzhen Chen, Carlo Strapparava, and Vivi Nastase. 2017. Improving native language identification by using spelling errors. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 542--546.Google ScholarGoogle ScholarCross RefCross Ref
  8. Marie-Pierre Dubuisson and Anil K. Jain. 1994. A modified hausdorff distance for object matching. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Conference A: Computer Vision 8 Image Processing (ICPR’94). 566--568.Google ScholarGoogle Scholar
  9. Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3591--3601.Google ScholarGoogle ScholarCross RefCross Ref
  10. Sylviane Granger. 2003. The international corpus of learner english: a new resource for foreign language learning and teaching and second language acquisition research. Tesol Quart. 37, 3 (2003), 538--546.Google ScholarGoogle ScholarCross RefCross Ref
  11. Sylviane Granger. 2009. The contribution of learner corpora to second language acquisition and foreign language teaching. Corpora and language teaching 33 (2009), 13--32.Google ScholarGoogle Scholar
  12. Saeed-Ul Hassan, Naif R. Aljohani, Nimra Idrees, Raheem Sarwar, Raheel Nawaz, Eugenio Martínez-Cámara, Sebastián Ventura, and Francisco Herrera. 2019. Predicting literature’s early impact with sentiment analysis in twitter. Knowledge-Based Systems 192 (2019), 105383.Google ScholarGoogle ScholarCross RefCross Ref
  13. Saeed-Ul Hassan, Raheem Sarwar, and Amina Muazzam. 2016. Tapping into intra-and international collaborations of the organization of islamic cooperation states across science and technology disciplines. Sci. Publ. Policy 43, 5 (2016), 690--701.Google ScholarGoogle ScholarCross RefCross Ref
  14. C. C. Holmes and N. M. Adams. 2002. A probabilistic nearest neighbour method for statistical pattern recognition. J. Roy. Stat. Soc. Ser. B 64, 2 (2002), 295--306.Google ScholarGoogle ScholarCross RefCross Ref
  15. Daniel P. Huttenlocher, Gregory A. Klanderman, and William Rucklidge. 1993. Comparing images using the hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850--863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 624--628.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dijana Kosmajac and Vlado Keselj. 2017. DalTeam@ INLI-FIRE-2017: Native language identification using SVM with SGD training. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’17). 118--122.Google ScholarGoogle Scholar
  18. M. Anand Kumar, Barathi Ganesh H. B., Shivkaran Singh, Soman K. P, and Paolo Rosso. 2017. Overview of the INLI PAN at FIRE-2017 track on indian native language identification. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’17). 99--105.Google ScholarGoogle Scholar
  19. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel R. Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Morgan 8 Claypool Publishers.Google ScholarGoogle Scholar
  20. Rajalida Lipikorn, Akinobu Shimizu, and Hidefumi Kobatake. 1994. A modified hausdorff distance for object matching. In Pattern Recognition, Vol. 1. 566--568.Google ScholarGoogle Scholar
  21. Dylan Lyons. 2018. How many people speak english, and where is it spoken. Retrieved November 11, 2018 from https://www.babbel.com/en/magazine/how-many-people-speak-english-and-where-is-it-spoken.Google ScholarGoogle Scholar
  22. Shervin Malmasi and Mark Dras. 2015. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1403--1409.Google ScholarGoogle ScholarCross RefCross Ref
  23. Shervin Malmasi and Mark Dras. 2017. Multilingual native language identification. Nat. Lang. Eng. 23, 2 (2017), 163--215.Google ScholarGoogle ScholarCross RefCross Ref
  24. Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel R. Tetreault, Robert A. Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 62--75.Google ScholarGoogle ScholarCross RefCross Ref
  25. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Comput. Ling. 19, 2 (1993), 313--330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fedelucio Narducci, Pierpaolo Basile, Cataldo Musto, Pasquale Lops, Annalina Caputo, Marco de Gemmis, Leo Iaquinta, and Giovanni Semeraro. 2016. Concept-based item representations for a cross-lingual content-based recommendation process. Inf. Sci. 374 (2016), 15--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2012. Identification of manner in bio-events. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3505--3510.Google ScholarGoogle Scholar
  28. Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2013. Negated bio-events: Analysis and identification. BMC Bioinf. 14, 1 (2013), 14.Google ScholarGoogle ScholarCross RefCross Ref
  29. Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, and Dickson Chow. 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 1125--1130.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ella Rabinovich, Yulia Tsvetkov, and Shuly Wintner. 2018. Native language cognate effects on second language lexical choice. Trans. Assoc. Comput. Linguist. 6 (2018), 329--342.Google ScholarGoogle ScholarCross RefCross Ref
  31. Fahad Sabah, Saeed-Ul Hassan, Amina Muazzam, Sehrish Iqbal, Saira Hanif Soroya, and Raheem Sarwar. 2019. Scientific collaboration networks in pakistan and their impact on institutional research performance: A case study based on scopus publications. Libr. Hi Tech 37, 1 (2019), 19--29.Google ScholarGoogle ScholarCross RefCross Ref
  32. Raheem Sarwar and Saeed-UI Hassan. 2015. A bibliometric assessment of scientific productivity and international collaboration of the islamic world in science and technology (S&T) areas. Scientometrics 105, 2 (2015), 1059--1077.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Raheem Sarwar, Qing Li, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. A scalable framework for cross-lingual authorship identification. Inf. Sci. 465 (2018), 323--339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Raheem Sarwar and Sarana Nutanong. 2016. The key factors and their influence in authorship attribution. Res. Comput. Sci. 110 (2016), 139--150.Google ScholarGoogle ScholarCross RefCross Ref
  35. Raheem Sarwar, Thanasarn Porthaveepong, Attapol Rutherford, Thanawin Rakthanmanon, and Sarana Nutanong. 2019. StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Trans. Asian Low-Res. Lang. Inf. Process. 19, 3 (2019), 36:1--36:15.Google ScholarGoogle Scholar
  36. Raheem Sarwar, Saira Hanif Soroya, Amina Muazzam, Fahad Sabah, Sehrish Iqbal, and Saeed-Ul Hassan. 2019. A bibliometric perspective on technology-driven innovation in the gulf cooperation council (GCC) countries in relation to its transformative impact on international business. In Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 49--66.Google ScholarGoogle Scholar
  37. Raheem Sarwar, Norawit Urailertprasert, Nattapol Vannaboot, Chenyun Yu, Thanawin Rakthanmanon, Ekapol Chuangsuwanich, and Sarana Nutanong. 2020. : Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8 (2020), 18374--18393.Google ScholarGoogle ScholarCross RefCross Ref
  38. Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, and Thanawin Rakthanmanon. 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications (DASFAA’18). 813--829.Google ScholarGoogle ScholarCross RefCross Ref
  39. Raheem Sarwar, Chenyun Yu, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Dickson Chow, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 50030--50048.Google ScholarGoogle ScholarCross RefCross Ref
  40. Anne Schiller, Simone Teufel, and Christine Thielen. 1995. Guidelines fur das tagging deutscher textcorpora mit STTS. Technical Report. Universität Stuttgart and Universität Tübingen, Germany.Google ScholarGoogle Scholar
  41. Matthew Shardlow, Riza Batista-Navarro, Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2018. Identification of research hypotheses and new knowledge from scientific literature. BMC Med. Inf. Decis. Making 18, 1 (2018), 1--13.Google ScholarGoogle Scholar
  42. Kari Tenfjord, Paul Meurer, and Knut Hofland. 2006. The ASK corpus - a language learner corpus of norwegian as a second language. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 1821--1824.Google ScholarGoogle Scholar
  43. Joel R. Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications. 48--57.Google ScholarGoogle Scholar
  44. Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2017. Enriching news events with meta-knowledge information. Lang. Resourc. Eval. 51, 2 (2017), 409--438.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Svitlana Volkova, Stephen Ranshous, and Lawrence Phillips. 2018. Predicting foreign language usage from english-only social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 608--614.Google ScholarGoogle ScholarCross RefCross Ref
  46. Maolin Wang, Shervin Malmasi, and Mingxuan Huang. 2015. The jinan chinese learner corpus. In Proceedings of the 10th Workshop on Innovative Use of NLP for Building Educational Applications. 118--123.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xinglong Wang, Rafal Rak, Angelo C. Restificar, Chikashi Nobata, C. J. Rupp, Riza Theresa Batista-Navarro, Raheel Nawaz, and Sophia Ananiadou. 2011. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinf. 12, 8 (2011), S11.Google ScholarGoogle ScholarCross RefCross Ref
  48. Sze-Meng Jojo Wong and Mark Dras. 2009. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association Workshop. 53--61.Google ScholarGoogle Scholar
  49. Sze-Meng Jojo Wong and Mar Dras. 2011. Exploiting parse structures for native language identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 1600--1610.Google ScholarGoogle Scholar

Index Terms

  1. Native Language Identification of Fluent and Advanced Non-Native Writers

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!