Abstract
Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems, thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this article, we describe our approach in retrieving potentially alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.1
1 INTRODUCTION
For over half a century, machine translation has been one of the well-studied subjects in natural language processing (NLP) [12, 22]. Although the operating principles of machine translation have been constantly improving from rule-based methods to statistical and neural network approaches, parallel corpora have remained essential components to efficiently address the complexity of human language in the translation task. A parallel corpus contains translation pairs in two languages or dialects that can be used for training translation models and learning the alignment of words and their placements within phrases. Creating such a resource is a tedious and time-consuming task that requires thorough linguist knowledge of the source and target languages. Oftentimes, lack of financial support further constrains the development of such resources for less-resourced languages, particularly Kurdish [8].
Multi-language news websites often provide similar content in different languages or dialects based on the same news source. Although the choice of the translators and editors determines how the original article is differently narrated in two different languages or dialects, such relevant news articles usually represent significant overlaps. Recently, parallel corpus filtering and alignment of crawled text from the web has gained more attention in the machine translation community [27, 29, 43, 45].
In the same vein, we create a parallel corpus for the Kurdish language by collecting news articles from some of the multilingual Kurdish news websites. Relying on key elements of a news article, such as date of publication, topic, and image URL, our approach filters articles at document-level. Given the diversity of the alphabets in our case, i.e., Arabic-based Kurdish alphabet for content in Sorani and Latin-based alphabet for English and Kurmanji, we also use transliteration to calculate basic string similarities. The most similar headlines of the filtered documents are then provided to native annotators who verify the relatedness of the news articles. This way, we could collect 1,452 Sorani-Kurmanji, 282 English-Sorani, and 277 English-Kurmanji articles. Following this step, the content of the relevant articles are automatically extracted and manually aligned at sentence level, yielding 12,327, 1,797, and 650 translation pairs in Sorani-Kurmanji, Sorani-English, and Kurmanji-English.
The rest of the article is organized as follows: We first provide a description of the previous work in the creation and alignment of parallel corpora and also present the available resources for Kurdish in Section 2. In Section 3, we briefly describe some of the grammatical aspects of Kurdish and English that are important in translation. Section 4 presents our approach on how the data is retrieved and aligned. Our parallel corpus is evaluated in Section 5. The article is concluded in Section 6.
2 RELATED WORK
During the early time of emergence of the Web contents, Resnik and Smith [40] addressed and discussed the usage of the Web for developing parallel corpora. In the absence or limited availability of the digitized translated literature or other documents that usually could form the basis of parallel corpora, the Web content has become a significant resource for the development of the parallel corpora. Literature reports on the usage of the Web contents for the development of parallel corpora in the absence of available data in various cases, particularly for less-resourced languages [13, 37, 38]. For instance, Inoue et al. [23] develop a parallel corpus for Arabic-Japanese based on news articles that is then manually aligned at the sentence level. Having said that, with the diversity of themes of the Web content, the representativeness of the developed corpus using this content could become an issue [46]. Regardless, the news content, whether online or paper-based, has remained as one of the main sources for the parallel corpus development [18, 23, 36, 50].
Regarding the Kurdish language, efforts have increased recently to create language resources, such as lexicographical resources [5], monolingual corpora [1, 16], dialect corpora [32], and even a folkloric corpus [4]. These have improved the situation that was reported by Hassani [21]. Moreover, the construction of inter-dialectal resources for Kurdish has been of interest previously. Hassani [20] studies the application of word-by-word translation for translating Kurmanji to Sorani using a bi-dialectal dictionary. The study aims to evaluate the efficiency of the method in the absence of parallel corpora. Although the experiments show reasonable outcome, the study reports unnaturalness in the translation.
There are fewer resources that include Kurdish parallel texts. The Tanzil corpus,2 which is a compilation of Quran translations, various Bible translations,3 the TED corpus,4 [11] and the KurdNet–the Kurdish WordNet [7] provide translations in Sorani Kurdish. Ahmadi and Masoud [6] use these resources to create machine translation systems for Sorani Kurdish; they report many issues in the performance of such systems related to the quality of the available parallel data. However, Kurmanji has received further attention in the machine translation realm. For instance, Ataman [9] reports on the creation of one parallel corpus for Kurmanji-Turkish-English. Moreover, Google Translate,5 the Google translation service, provides Kurmanji in the list of its languages. Although the resources are not openly available, we believe that crowd-sourcing projects contribute to such projects.
To create a parallel corpus for the Sorani-Kurmanji dialects of Kurdish—and also, as a preliminary effort to create Sorani-English and Kurmanji-English parallel corpora—we report our endeavor to create parallel corpora for Kurdish based on the content of Kurdish News websites.
3 KURDISH LANGUAGE
3.1 Alphabets and Dialects
Some scholars categorize Kurdish as a dialect continuum for which language intelligibility varies from region to region [19]. Generally, Kurdish is believed to have three main dialects: Northern Kurdish (Kurmanji), Central Kurdish (Sorani), and Southern Kurdish [34]. These three dialects are spoken by 20–30 million speakers in the Kurdish regions of Iraq, Iran, Turkey, and Syria [5]. While many multi-dialect languages, such as Arabic or Chinese, exist in which one could find mutually unintelligible dialects, they usually have a standard form that regulates the communication among the speakers. Regarding Kurdish, although the standardization of the language, both in written and spoken forms, has been widely discussed, there is still no consensus among scholars and also the speakers [26]. As a result, the language is written in many scripts, mainly Arabic-based and Latin-based, and each dialect is used as distinct languages in the media [20, 47]. Table 1 provides the alphabets used for writing Kurdish in a comparative way.
3.2 Vocabulary
The lexical diversity and richness of Kurdish has been previously attested by many lexicographers [10, 15, 30, 31, 44]. This diversity is to such an extent that the vocabulary may vary from one village to another. Moreover, being in touch with many regional languages, especially Arabic, Persian, Turkish, Armenian, and local languages, particularly Zazaki and Gorani, almost all Kurdish dialects have entered many lexical borrowings into the language as well [14]. Having an oral tradition in narrating poetry and prose, the oral literature has been considered as a source of vocabulary by lexicographers [4]. In addition, there is an ongoing struggle to develop modern technical terminologies for the language.
Regarding Kurdish lexicographic resources, Reference [5] survey the current state of Kurdish lexicography and state that despite the scarcity of resources in electronic forms for Kurdish, there are over 71 dictionaries and terminological resources for Kurdish that are not all recto-digitized.
3.3 Grammar
Despite the lexical similarity between the dialects of Kurdish, there are differences when it comes to grammar, particularly due to morphological constructions. Sorani tends to have a more complex morphological construction, while Kurmanji is less inflected. For instance, passive voice in Sorani is derived from the transitive verbs, while in Kurmanji, passive voice has a simpler construction where a compound is created by adding the auxiliary verb hatin “to come” to the transitive verb without any major morphological modification [49]. In addition, Sorani has a full article marking system where nouns are marked as definite, indefinite, demonstrative in singular and plural forms, while articles in Kurmanji are marked only in definite and demonstrative cases [24].
Regarding grammatical cases, unlike Sorani and English, Kurmanji has two grammatical genders, i.e., feminine and masculine, which implies a grammatical agreement particularly in Izafe (also known as Ezafe) constructions [42]. The Izafa construction refers to the usage of a grammatical particle to form noun phrases or adjective phrases. This grammatical particle in Kurmanji and Sorani are, respectively, -ê, -ekî, -a, -eke, -ên, and -î, -e [41, 48]. Although in the adjective phrases, the particle is not translated, e.g., xanîyêbiçûk “the small house,” in the noun phrases it is usually translated as “of,” e.g., xanîyê wî mirovî “the house of that man.”
Table 2 provides some of the major grammatical characteristics of Kurmanji, Sorani, and English. Both Kurdish dialects have a subject-object-verb alignment for present tenses and intransitive verbs and an agent-object-verb alignment for transitive verbs in the past tense. The morphosyntactic property of agreement of the subject of intransitive verbs as the object (patient) of transitive verbs in the past tenses is known as ergativity and also exists in Kurdish [25]. Unlike Kurmanji Kurdish, which uses oblique case of pronouns for this purpose, Sorani Kurdish only uses different pronominal clitics to demonstrate such an alignment [17].
| Language | Word order | Passive | Gender | Case | Alignment |
|---|---|---|---|---|---|
| Kurmanji Kurdish | S-O-V | periphrastic with hatin (to come) [48] | feminine, masculine [48] | nominative, oblique, Izafa, vocative [48] | nominative–accusative, only in past transitive ergative–absolutive [33] |
| Sorani Kurdish | S-O-V | morphological [49] | no gender [49] | nominative, locative, vocative [35] | nominative–accusative, only in past transitive ergative–absolutive [25] |
| English | S-V-O | periphrastic | no gender | nominative, oblique, genitive only for personal pronouns | nominative–accusative |
Table 2. A Comparison of the Sorani and Kurmanji Dialects of Kurdish with English
It is worth mentioning that variations exist among Sorani subdialects, particularly the dialects that are categorized as Northern Sorani in Reference [34], which take use of oblique cases and grammatical gender to some extent.
4 METHODOLOGY
Multilingual news websites contain a large number of articles in various languages that can be considered a potentially parallel corpus. However, among the major Kurdish news agencies, listed in Table 3, none of them explicitly link identical articles across languages, e.g., by using reference keys or identical URL schema or news code. Moreover, only a few of them provide the same content in various languages. For instance, the English articles on BasNews are different in content and topic in comparison to the Kurdish ones.
| agency | languages |
|---|---|
| Rûdaw | Sorani, Kurmanji, English, Arabic, Turkish |
| VoiceofAmerica | Sorani, Kurmanji, English, Turkish, and many more |
| Kurdistan24 | Sorani, Kurmanji, English, Arabic, Turkish, Persian |
| KNN | Sorani, English, Arabic |
| FiratNewsAgency | Sorani, Kurmanji, Zazaki, Gorani, English, Arabic, Turkish, Persian, German, Russian, Spanish |
| Bianet | Kurmanji, English, Turkish |
| BasNews | Sorani, Kurmanji, English, Arabic, Turkish, Persian |
| KurdPa | Sorani, Kurmanji, English, Persian |
| GulanMedia | Sorani, Kurmanji, English, Arabic |
| NRT | Sorani, English, Arabic |
| SaharTV | Sorani, Kurmanji, English, Persian |
Table 3. List of News Agencies Providing Content in Kurdish and Their Content Management Status
In this section, we describe our approach, which is illustrated in Figure 1, to create a parallel corpus of Sorani, Kurmanji and English. We refer to these three as languages for ease of reference.
Fig. 1. Our approach to automatically retrieve identical news articles.
4.1 Data Crawling
As the first step, we crawl the content of news websites. Our selection criteria are the editorial quality of the articles, accessibility of the data to be automatically scraped, and more importantly, multilingualism. Therefore, we selected Firat News Agency (ANF), BasNew (BN), and KurdPa (KP). Despite the remarkable size of articles published on Rûdaw and Kurdistan 24, we could not include those websites due to crawling restrictions. Moreover, our findings regarding the alignment of Voice of America was not satisfying due to sparsity of topics across languages.
Once the news articles are crawled, we clean the HTML files and extract the following information from each page:
tag : a list of the tags used for identifying the article. For this purpose,bashakan ,cat-links ,keywords tags were originally used in BN, KP, and VOA, respectively. In the case of ANF, we used the page hyperlink structure to extract the topic and used it as a tag.original_link : the original link to the article on the websitedialect : the dialect of the article retrieved using the link schema, usuallyso for Sorani andku for Kurmanjientry-title : the news headlineentry-lead : the news sub-headline, if provideddate : the publication date of the article. We unified all the date formats based on the Gregorian calendar given the variety of calendars, e.g., Kurdish or Persian calendarsentry-content : a list containing paragraphs, i.e.,<p> , provided in the content of each news article. The content of our target websites are originally marked with the<entry-content> tag.imgs : Assuming that relevant news articles link to the same multimedia content with the same hyperlink, we retrieve the hyperlinks associated to the<img> tags within the body of the article.
In addition to the HTML tags, in some cases, we could use JSON-LD and the meta tags, i.e.,
4.2 Corpus Filtering
Given two sets of articles of the same news website in two languages, we consider two articles alignable if they, at least, have one common tag and identical publication dates with the exact month and year. Intuitively speaking, two articles published in two different years with two different tags (topics) are less probable to contain the same content. In addition to this, we also use
Moreover, as several news articles could be published with the same tags within the same date range, we further filter out the candidate articles by comparing the headlines. To do so, we calculate the similarity of the headlines based on the a simple string sequence matching scorer. In the case of Sorani, as it is written in the Arabic-based alphabet, we first transliterate the Sorani text, using Wergor [2], into the Latin-based script that is used for Kurmanji and English.
As the final step, among the candidate headlines, we retrieve the top five most similar headlines in the other language. These headlines are then provided in spreadsheets to native annotators who determine if two headlines correspond to the same news content using a drop-down list. If two headlines are literal translations and refer to the same content, then they are specified as equivalent. However, this is not always the case, as some headlines are paraphrases and rewritten in such a way that they attract the readers’ attention. In such cases where two headlines refer to the same content but are not literal translations, they are annotated as possible. Although we do not consider such headlines as a translation pair, they are essential to retrieve relevant contents. In the cases where the headlines do not provide sufficient information to decide their relatedness, annotators are asked to check the crawled data in the two languages manually. Figure 2 in Appendix A illustrates an annotation example in Kurmanji and English.
4.3 Content Alignment
As the result of the previous steps, a list of the alignable articles of the same news website in two languages is available. Using the aligned headlines, we collect their contents, i.e., the content of
In the manual alignment task, we extract translation pairs based on the following guidelines:
(1) | the length of the sentences or phrases should be within a reasonable range. If too long, then they are to be split into smaller phrases; | ||||
(2) | idiomatic translations are validated as long as they do not add to the size of the sentence significantly; | ||||
(3) | if the translation of a sentence is provided in many separate sentences or phrases, then the annotator is allowed to merge the sentences to create a valid translation pair; | ||||
(4) | if two sentences can be validated with slight modifications, such as punctuation marks or digits, then the annotator is allowed to edit the content. | ||||
5 EVALUATION
Table 4 presents basic statistics of the corpus where the whole number of crawled articles and the number of retrieved articles among them are provided. We also specify the number of articles that are retrieved using multimedia hyperlinks using
In all the translation pairs, 17 to 20 tokens are on average present in each sentence. In contrast, the average number of tokens in Tanzil, TED, and KurdNet corpora is, respectively, around 25, 70, and 6. As such, we believe that our resources are comparatively better when it comes to automatic alignment.
In addition to the basic statistics, we used Moses [28] to test and evaluate the usage of the corpus in the statistical machine translation. We divided the corpus into two sets: 90% as a training set and 10% as a test set. The training set received a higher percentage because of the relatively small size of our corpora. The sets were selected randomly. We prepared the random selection scripts in a way that the whole experiment is reproducible. We trained Moses according to its recommended procedures.7 We also tested the accuracy of the system based on the Moses guideline that provides the BLEU [39] evaluation based on the test set. Table 5 presents the results of BLEU scores for the Sorani-English, Kurmanji-English, and Sorani-Kurmanji data.
In addition to a considerable amount of data, the performance of the baseline system relies on other important tasks, particularly tokenization. Given that in our baseline system sentences are tokenized based on spaces, we believe that the performance could be improved significantly with a language-specific tokenization tool for Kurdish, such as the one described in Reference [3]. This task is of importance due to the morphologically complex word forms in Kurdish that make alignment of sentences challenging.
6 CONCLUSION AND FUTURE WORK
In this article, we report our efforts in creating a parallel corpus for the Kurdish language as a less-resourced language. Given that manual translation is an expensive and tedious task, we used the content of multilingual Kurdish news websites to extract potentially alignable Sorani, Kurmanji, and English sentences in a semi-automatic manner. The candidate sentences are then provided to native speakers to validate if they are translation pairs. This way, the task of translation is carried out as an annotation task. Our corpus contains 12,327 Sorani-Kurmanji, 1,797 Kurmanji-English, and 650 Sorani-English translation pairs.
As the material for machine translation, we believe that our resource can pave the way for further developments in Kurdish machine translation. To facilitate the alignment of the news articles, we also propose that a referencing mechanism be embedded within each news article so corresponding texts could be linked more easily in the future. We would also like to suggest our approach to further extend the current corpus or create new corpora for the other dialects of Kurdish. Furthermore, machine translation is one of the important future tasks that should be addressed for Kurdish. Various tasks related to machine translation should be addressed, especially using the more advanced techniques relying on neural network methods.
A APPENDIX
Fig. A.2. An example of the alignment of headlines. For each headline in English (left column), the five most similar headlines among the filtered Kurmanji headlines are provided. Using the drop-down list in the middle column, the annotator determines if two headlines are literal translations by selecting equivalent or if they are not literal translation but correspond to each other by selecting possible.
Fig. A.3. Examples of good translation pairs in our corpus.
Footnotes
- Footnote
- Footnote
- Footnote
5 https://translate.google.com/.
Footnote6 https://wanthalf.saga.cz/intertext.
Footnote7 According to the Moses Baseline system: http://www.statmt.org/moses/?n=Moses.Baseline.
Footnote
- [1] . 2019. Developing a fine-grained corpus for a less-resourced language: The case of Kurdish. In Proceedings of the ACL Widening Natural Language Processing Workshop (WiNLP ACL’19).Google Scholar
- [2] . 2019. A rule-based Kurdish text transliteration system. Asian Low-Resour. Lang. Inf. Process. 18, 2 (2019), 18:1–18:8.Google Scholar
- [3] . 2020. A tokenization system for the Kurdish language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), 114–127. Retrieved from https://aclanthology.org/2020.vardial-1.11.Google Scholar
- [4] . 2020. A Corpus of the Sorani Kurdish folkloric lyrics. In Proceedings of the 1st Joint Spoken Language Technologies for Under-resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop at the 12th International Conference on Language Resources and Evaluation (LREC).Google Scholar
- [5] . 2019. Towards electronic lexicography for the Kurdish language. In Proceedings of the eLex Conference. 881–906.Google Scholar
- [6] . 2020. Towards machine translation for the Kurdish language. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics, 87–98. Retrieved from https://aclanthology.org/2020.loresmt-1.12.Google Scholar
- [7] . 2014. Towards building Kurdnet, the Kurdish Wordnet. In Proceedings of the 7th Global Wordnet Conference. 1–6.Google Scholar
- [8] . 2012. Toward computational processing of less resourced languages: Primarily experiments for Moroccan Amazigh language. In Theory and Applications for Advanced Text Mining, Ch 9, Shigeaki Sakurai (Eds.).
DOI: Google ScholarCross Ref
- [9] . 2018. Bianet: A parallel news corpus in Turkish, Kurdish and English. arXiv preprint arXiv:1805.05095 (2018).Google Scholar
- [10] . 2009. Ferheng: Kurdî, Kurdî (Kurdish-Kurdish dictionary) (Kurmanji). Vol. 2. Avesta.Google Scholar
- [11] . 2012. WIT3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation. 261–268.Google Scholar
- [12] . 2012. Theoretical overview of machine translation. In Proceedings of the 4th International Conference on Web and Information Technologies (ICWIT’12). 160–169.Google Scholar
- [13] . 2020. Development of a Guarani-Spanish parallel corpus. In Proceedings of The 12th Language Resources and Evaluation Conference. 2629–2633.Google Scholar
- [14] . 2020. FERHENGA BIRÛSKÎ Kurmanji - English Dictionary Volume One: A - L. Transnational Press London. Retrieved from https://books.google.ie/books?id=dVrIDwAAQBAJ.Google Scholar
- [15] . 2003. Kurdish-English Dictionary. Yale University Press.Google Scholar
- [16] . 2013. Building a test collection for Sorani Kurdish. In ACS International Conference on Computer Systems and Applications (AICCSA’13). IEEE, 1–7.Google Scholar
Cross Ref
- [17] . 2013. Sorani Kurdish versus Kurmanji Kurdish: An empirical comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2. 300–305.Google Scholar
- [18] . 2005. Assembling a parallel corpus from RSS news feeds. In Proceedings of the MT Summit X.Google Scholar
- [19] . 2002. Kurdish linguistics: A brief overview. STUF - Language Typology and Universals 1, 55 (2002), 3–14.
DOI: Google ScholarCross Ref
- [20] . 2017. Kurdish interdialect machine translation. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’17). 63–72.Google Scholar
Cross Ref
- [21] . 2018. BLARK for multi-dialect languages: Towards the Kurdish BLARK. Lang. Resour. Eval. 52, 2 (2018), 625–644.Google Scholar
Digital Library
- [22] . 2005. Current commercial machine translation systems and computer-based translation tools: System types and their uses. Int. J. Translat. 17, 1–2 (2005), 5–38.Google Scholar
- [23] . 2018. A parallel corpus of Arabic-Japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- [24] . 2014. On the linguistic history of Kurdish. Kurd. Stud. 2, 2 (2014), 123–142.Google Scholar
Cross Ref
- [25] . 2014. On the syntax of ergativity in Kurdish. Poznan Stud. Contemp. Ling. 50, 3 (2014), 231–271.Google Scholar
Cross Ref
- [26] . 2015. Kurdish dialect continuum, as a standardization solution. Int. J. Kurd. Stud. 1, 1 (2015), 27–39.Google Scholar
- [27] . 2019. Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). 54–72.Google Scholar
Cross Ref
- [28] . 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177–180.Google Scholar
Digital Library
- [29] . 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. 726–739.Google Scholar
Cross Ref
- [30] . 2012. University of Kurdistan Dictionary: Persian-Kurdish. Vol. 3. University of Kurdistan, Sanandaj Iran.Google Scholar
- [31] . 2018. University of Kurdistan Dictionary: Kurdish-Kurdish-Persian. Vol. 4. University of Kurdistan, Sanandaj Iran.Google Scholar
- [32] . 2016. Subdialectal differences in Sorani Kurdish. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VARDIAL’16). 89–96.Google Scholar
- [33] . 1997. Clause combining, ergativity, and coreferent deletion in Kurmanji. Stud. Lang. Int. J. spons. Found. “Found. Lang.” 21, 3 (1997), 613–653.Google Scholar
- [34] . 2017. Revisiting Kurdish dialect geography: Preliminary findings from the Manchester Database.Google Scholar
- [35] . 2007. Kurdish morphology. Morphol. Asia Afr. 2 (2007), 1021–1049. http://kurdish.humanities.manchester.ac.uk/wp-content/uploads/2017/07/PDF-Revisiting-Kurdish-dialect-geography.pdf.Google Scholar
Cross Ref
- [36] . 2020. Content-equivalent translated parallel news corpus and extension of domain adaptation for NMT. In Proceedings of the 12th Language Resources and Evaluation Conference. 3616–3622.Google Scholar
- [37] . 2020. JParaCrawl: A large scale web-based English-Japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 3603–3609. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.443.Google Scholar
- [38] . 2020. Constructing a bilingual corpus of parallel tweets. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora. 14–21.Google Scholar
- [39] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [40] . 2003. The web as a parallel corpus. Comput. Ling. 29, 3 (2003), 349–380.
DOI :Google ScholarDigital Library
- [41] . 2018. Constraints on Izāfa in Sorani Kurdish. In Theses and Dissertations–Linguistics 31. University of Kentucky. Retrieved from https://uknowledge.uky.edu/ltt_etds/31.Google Scholar
- [42] . 2007. The Ezafe as a head-marking inflectional affix: Evidence from Persian and Kurmanji Kurdish. In Aspects of Iranian Linguistics: Papers in Honor of Mohammad Reza Bateni, , , and (Eds.). Cambridge Scholars LTD, 339–361. Retrieved from https://halshs.archives-ouvertes.fr/halshs-00673182.Google Scholar
- [43] . 2019. Parallel corpus filtering based on fuzzy string matching. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). 289–293.Google Scholar
Cross Ref
- [44] . 1991. Hanbana Borina: Kurdish-Persian Dictionary. Vol. 2. Soroush, Tehran.Google Scholar
- [45] . 2020. Effectively aligning and filtering parallel corpora under sparse data conditions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 182–190.Google Scholar
Cross Ref
- [46] . 2000. Building the Croatian-English parallel corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. 523–530.Google Scholar
- [47] . 2019. Spreading of the Kurdish language dialects and writing systems used in the Middle East. Bull. Georg. Natl. Acad. Sci 13, 1 (2019).Google Scholar
- [48] . 2006. Kurmanji Kurdish:-A Reference Grammar with Selected Readings. Harvard University.Google Scholar
- [49] . 2006. Sorani Kurdish–A Reference Grammar with Selected Readings. Harvard University.Google Scholar
- [50] . 2014. TLAXCALA: A multilingual corpus of independent news. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 3689–3692. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf.Google Scholar
- [51] . 2014. Aligning parallel texts with InterText. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).1875–1879.Google Scholar
Index Terms
(auto-classified)Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
Recommendations
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word ...
Building a Spanish-Portuguese parallel corpus for statistical machine translation
WebMedia '08: Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the WebParallel corpora have long been recognised as valuable resources for building MT applications, but their usefulness have often been limited to the translation between language pairs that include English. In this work we describe our efforts to build a ...
Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and MultimediaThis paper describes the development of Hindi-Punjabi sentence aligned parallel corpus consisting of 50K sentences using existing Hindi-Punjabi Machine Translation (MT) system (available at http://h2p.learnpunjabi.org). This parallel corpus is utmost ...











Comments