Abstract
Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
- Ahmed Abbasi and Hsinchun Chen. 2005. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 5 (2005), 67--75.Google Scholar
Digital Library
- Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a treebank for french. In Treebanks. 165--187.Google Scholar
- Sophia Ananiadou, Paul Thompson, and Raheel Nawaz. 2013. Enhancing search: events and their discourse context. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13). 318--334.Google Scholar
Digital Library
- Riza Theresa Batista-Navarro, Georgios Kontonatsios, Claudiu Mihaila, Paul Thompson, Rafal Rak, Raheel Nawaz, Ioannis Korkontzelos, and Sophia Ananiadou. 2013. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13). 559--571.Google Scholar
Digital Library
- Stephen D. Bay. 1999. Nearest neighbor classification from multiple feature subsets. Intell. Data Anal. 3, 3 (1999), 191--209.Google Scholar
Digital Library
- Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native english. ETS Res. Rep. Ser. 2013, 2 (2013), i--15.Google Scholar
Cross Ref
- Lingzhen Chen, Carlo Strapparava, and Vivi Nastase. 2017. Improving native language identification by using spelling errors. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 542--546.Google Scholar
Cross Ref
- Marie-Pierre Dubuisson and Anil K. Jain. 1994. A modified hausdorff distance for object matching. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Conference A: Computer Vision 8 Image Processing (ICPR’94). 566--568.Google Scholar
- Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3591--3601.Google Scholar
Cross Ref
- Sylviane Granger. 2003. The international corpus of learner english: a new resource for foreign language learning and teaching and second language acquisition research. Tesol Quart. 37, 3 (2003), 538--546.Google Scholar
Cross Ref
- Sylviane Granger. 2009. The contribution of learner corpora to second language acquisition and foreign language teaching. Corpora and language teaching 33 (2009), 13--32.Google Scholar
- Saeed-Ul Hassan, Naif R. Aljohani, Nimra Idrees, Raheem Sarwar, Raheel Nawaz, Eugenio Martínez-Cámara, Sebastián Ventura, and Francisco Herrera. 2019. Predicting literature’s early impact with sentiment analysis in twitter. Knowledge-Based Systems 192 (2019), 105383.Google Scholar
Cross Ref
- Saeed-Ul Hassan, Raheem Sarwar, and Amina Muazzam. 2016. Tapping into intra-and international collaborations of the organization of islamic cooperation states across science and technology disciplines. Sci. Publ. Policy 43, 5 (2016), 690--701.Google Scholar
Cross Ref
- C. C. Holmes and N. M. Adams. 2002. A probabilistic nearest neighbour method for statistical pattern recognition. J. Roy. Stat. Soc. Ser. B 64, 2 (2002), 295--306.Google Scholar
Cross Ref
- Daniel P. Huttenlocher, Gregory A. Klanderman, and William Rucklidge. 1993. Comparing images using the hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850--863.Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 624--628.Google Scholar
Digital Library
- Dijana Kosmajac and Vlado Keselj. 2017. DalTeam@ INLI-FIRE-2017: Native language identification using SVM with SGD training. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’17). 118--122.Google Scholar
- M. Anand Kumar, Barathi Ganesh H. B., Shivkaran Singh, Soman K. P, and Paolo Rosso. 2017. Overview of the INLI PAN at FIRE-2017 track on indian native language identification. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’17). 99--105.Google Scholar
- Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel R. Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Morgan 8 Claypool Publishers.Google Scholar
- Rajalida Lipikorn, Akinobu Shimizu, and Hidefumi Kobatake. 1994. A modified hausdorff distance for object matching. In Pattern Recognition, Vol. 1. 566--568.Google Scholar
- Dylan Lyons. 2018. How many people speak english, and where is it spoken. Retrieved November 11, 2018 from https://www.babbel.com/en/magazine/how-many-people-speak-english-and-where-is-it-spoken.Google Scholar
- Shervin Malmasi and Mark Dras. 2015. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1403--1409.Google Scholar
Cross Ref
- Shervin Malmasi and Mark Dras. 2017. Multilingual native language identification. Nat. Lang. Eng. 23, 2 (2017), 163--215.Google Scholar
Cross Ref
- Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel R. Tetreault, Robert A. Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 62--75.Google Scholar
Cross Ref
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Comput. Ling. 19, 2 (1993), 313--330.Google Scholar
Digital Library
- Fedelucio Narducci, Pierpaolo Basile, Cataldo Musto, Pasquale Lops, Annalina Caputo, Marco de Gemmis, Leo Iaquinta, and Giovanni Semeraro. 2016. Concept-based item representations for a cross-lingual content-based recommendation process. Inf. Sci. 374 (2016), 15--31.Google Scholar
Digital Library
- Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2012. Identification of manner in bio-events. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3505--3510.Google Scholar
- Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2013. Negated bio-events: Analysis and identification. BMC Bioinf. 14, 1 (2013), 14.Google Scholar
Cross Ref
- Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, and Dickson Chow. 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 1125--1130.Google Scholar
Cross Ref
- Ella Rabinovich, Yulia Tsvetkov, and Shuly Wintner. 2018. Native language cognate effects on second language lexical choice. Trans. Assoc. Comput. Linguist. 6 (2018), 329--342.Google Scholar
Cross Ref
- Fahad Sabah, Saeed-Ul Hassan, Amina Muazzam, Sehrish Iqbal, Saira Hanif Soroya, and Raheem Sarwar. 2019. Scientific collaboration networks in pakistan and their impact on institutional research performance: A case study based on scopus publications. Libr. Hi Tech 37, 1 (2019), 19--29.Google Scholar
Cross Ref
- Raheem Sarwar and Saeed-UI Hassan. 2015. A bibliometric assessment of scientific productivity and international collaboration of the islamic world in science and technology (S&T) areas. Scientometrics 105, 2 (2015), 1059--1077.Google Scholar
Digital Library
- Raheem Sarwar, Qing Li, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. A scalable framework for cross-lingual authorship identification. Inf. Sci. 465 (2018), 323--339.Google Scholar
Digital Library
- Raheem Sarwar and Sarana Nutanong. 2016. The key factors and their influence in authorship attribution. Res. Comput. Sci. 110 (2016), 139--150.Google Scholar
Cross Ref
- Raheem Sarwar, Thanasarn Porthaveepong, Attapol Rutherford, Thanawin Rakthanmanon, and Sarana Nutanong. 2019. StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Trans. Asian Low-Res. Lang. Inf. Process. 19, 3 (2019), 36:1--36:15.Google Scholar
- Raheem Sarwar, Saira Hanif Soroya, Amina Muazzam, Fahad Sabah, Sehrish Iqbal, and Saeed-Ul Hassan. 2019. A bibliometric perspective on technology-driven innovation in the gulf cooperation council (GCC) countries in relation to its transformative impact on international business. In Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 49--66.Google Scholar
- Raheem Sarwar, Norawit Urailertprasert, Nattapol Vannaboot, Chenyun Yu, Thanawin Rakthanmanon, Ekapol Chuangsuwanich, and Sarana Nutanong. 2020. : Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8 (2020), 18374--18393.Google Scholar
Cross Ref
- Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, and Thanawin Rakthanmanon. 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications (DASFAA’18). 813--829.Google Scholar
Cross Ref
- Raheem Sarwar, Chenyun Yu, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Dickson Chow, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 50030--50048.Google Scholar
Cross Ref
- Anne Schiller, Simone Teufel, and Christine Thielen. 1995. Guidelines fur das tagging deutscher textcorpora mit STTS. Technical Report. Universität Stuttgart and Universität Tübingen, Germany.Google Scholar
- Matthew Shardlow, Riza Batista-Navarro, Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2018. Identification of research hypotheses and new knowledge from scientific literature. BMC Med. Inf. Decis. Making 18, 1 (2018), 1--13.Google Scholar
- Kari Tenfjord, Paul Meurer, and Knut Hofland. 2006. The ASK corpus - a language learner corpus of norwegian as a second language. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 1821--1824.Google Scholar
- Joel R. Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications. 48--57.Google Scholar
- Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2017. Enriching news events with meta-knowledge information. Lang. Resourc. Eval. 51, 2 (2017), 409--438.Google Scholar
Digital Library
- Svitlana Volkova, Stephen Ranshous, and Lawrence Phillips. 2018. Predicting foreign language usage from english-only social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 608--614.Google Scholar
Cross Ref
- Maolin Wang, Shervin Malmasi, and Mingxuan Huang. 2015. The jinan chinese learner corpus. In Proceedings of the 10th Workshop on Innovative Use of NLP for Building Educational Applications. 118--123.Google Scholar
Cross Ref
- Xinglong Wang, Rafal Rak, Angelo C. Restificar, Chikashi Nobata, C. J. Rupp, Riza Theresa Batista-Navarro, Raheel Nawaz, and Sophia Ananiadou. 2011. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinf. 12, 8 (2011), S11.Google Scholar
Cross Ref
- Sze-Meng Jojo Wong and Mark Dras. 2009. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association Workshop. 53--61.Google Scholar
- Sze-Meng Jojo Wong and Mar Dras. 2011. Exploiting parse structures for native language identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 1600--1610.Google Scholar
Index Terms
Native Language Identification of Fluent and Advanced Non-Native Writers
Recommendations
Native Language Identification: The Role of Consonants
ACAI '19: Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial IntelligenceNative language identification is relevant in speech technology, second language learning, forensic analysis and cross-cultural communication. Here we explore the contribution of consonantal articulation in this process. Specifically, we investigate ...
Indian Native Language Identification - INLI 2018
FIRE '18: Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval EvaluationThe growth of digital platforms enables the industries to serve user specific services. Most of the time, the information of the internet users are not explicitly available and it acts as a constrain in developing the personalized applications. There ...
UrduAI: Writeprints for Urdu Authorship Identification
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains ...






Comments