skip to main content
research-article
Open Access

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Authors Info & Claims
Published:29 April 2022Publication History

Skip Abstract Section

Abstract

Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems, thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this article, we describe our approach in retrieving potentially alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.1

Skip 1INTRODUCTION Section

1 INTRODUCTION

For over half a century, machine translation has been one of the well-studied subjects in natural language processing (NLP) [12, 22]. Although the operating principles of machine translation have been constantly improving from rule-based methods to statistical and neural network approaches, parallel corpora have remained essential components to efficiently address the complexity of human language in the translation task. A parallel corpus contains translation pairs in two languages or dialects that can be used for training translation models and learning the alignment of words and their placements within phrases. Creating such a resource is a tedious and time-consuming task that requires thorough linguist knowledge of the source and target languages. Oftentimes, lack of financial support further constrains the development of such resources for less-resourced languages, particularly Kurdish [8].

Multi-language news websites often provide similar content in different languages or dialects based on the same news source. Although the choice of the translators and editors determines how the original article is differently narrated in two different languages or dialects, such relevant news articles usually represent significant overlaps. Recently, parallel corpus filtering and alignment of crawled text from the web has gained more attention in the machine translation community [27, 29, 43, 45].

In the same vein, we create a parallel corpus for the Kurdish language by collecting news articles from some of the multilingual Kurdish news websites. Relying on key elements of a news article, such as date of publication, topic, and image URL, our approach filters articles at document-level. Given the diversity of the alphabets in our case, i.e., Arabic-based Kurdish alphabet for content in Sorani and Latin-based alphabet for English and Kurmanji, we also use transliteration to calculate basic string similarities. The most similar headlines of the filtered documents are then provided to native annotators who verify the relatedness of the news articles. This way, we could collect 1,452 Sorani-Kurmanji, 282 English-Sorani, and 277 English-Kurmanji articles. Following this step, the content of the relevant articles are automatically extracted and manually aligned at sentence level, yielding 12,327, 1,797, and 650 translation pairs in Sorani-Kurmanji, Sorani-English, and Kurmanji-English.

The rest of the article is organized as follows: We first provide a description of the previous work in the creation and alignment of parallel corpora and also present the available resources for Kurdish in Section 2. In Section 3, we briefly describe some of the grammatical aspects of Kurdish and English that are important in translation. Section 4 presents our approach on how the data is retrieved and aligned. Our parallel corpus is evaluated in Section 5. The article is concluded in Section 6.

Skip 2RELATED WORK Section

2 RELATED WORK

During the early time of emergence of the Web contents, Resnik and Smith [40] addressed and discussed the usage of the Web for developing parallel corpora. In the absence or limited availability of the digitized translated literature or other documents that usually could form the basis of parallel corpora, the Web content has become a significant resource for the development of the parallel corpora. Literature reports on the usage of the Web contents for the development of parallel corpora in the absence of available data in various cases, particularly for less-resourced languages [13, 37, 38]. For instance, Inoue et al. [23] develop a parallel corpus for Arabic-Japanese based on news articles that is then manually aligned at the sentence level. Having said that, with the diversity of themes of the Web content, the representativeness of the developed corpus using this content could become an issue [46]. Regardless, the news content, whether online or paper-based, has remained as one of the main sources for the parallel corpus development [18, 23, 36, 50].

Regarding the Kurdish language, efforts have increased recently to create language resources, such as lexicographical resources [5], monolingual corpora [1, 16], dialect corpora [32], and even a folkloric corpus [4]. These have improved the situation that was reported by Hassani [21]. Moreover, the construction of inter-dialectal resources for Kurdish has been of interest previously. Hassani [20] studies the application of word-by-word translation for translating Kurmanji to Sorani using a bi-dialectal dictionary. The study aims to evaluate the efficiency of the method in the absence of parallel corpora. Although the experiments show reasonable outcome, the study reports unnaturalness in the translation.

There are fewer resources that include Kurdish parallel texts. The Tanzil corpus,2 which is a compilation of Quran translations, various Bible translations,3 the TED corpus,4 [11] and the KurdNet–the Kurdish WordNet [7] provide translations in Sorani Kurdish. Ahmadi and Masoud [6] use these resources to create machine translation systems for Sorani Kurdish; they report many issues in the performance of such systems related to the quality of the available parallel data. However, Kurmanji has received further attention in the machine translation realm. For instance, Ataman [9] reports on the creation of one parallel corpus for Kurmanji-Turkish-English. Moreover, Google Translate,5 the Google translation service, provides Kurmanji in the list of its languages. Although the resources are not openly available, we believe that crowd-sourcing projects contribute to such projects.

To create a parallel corpus for the Sorani-Kurmanji dialects of Kurdish—and also, as a preliminary effort to create Sorani-English and Kurmanji-English parallel corpora—we report our endeavor to create parallel corpora for Kurdish based on the content of Kurdish News websites.

Skip 3KURDISH LANGUAGE Section

3 KURDISH LANGUAGE

3.1 Alphabets and Dialects

Some scholars categorize Kurdish as a dialect continuum for which language intelligibility varies from region to region [19]. Generally, Kurdish is believed to have three main dialects: Northern Kurdish (Kurmanji), Central Kurdish (Sorani), and Southern Kurdish [34]. These three dialects are spoken by 20–30 million speakers in the Kurdish regions of Iraq, Iran, Turkey, and Syria [5]. While many multi-dialect languages, such as Arabic or Chinese, exist in which one could find mutually unintelligible dialects, they usually have a standard form that regulates the communication among the speakers. Regarding Kurdish, although the standardization of the language, both in written and spoken forms, has been widely discussed, there is still no consensus among scholars and also the speakers [26]. As a result, the language is written in many scripts, mainly Arabic-based and Latin-based, and each dialect is used as distinct languages in the media [20, 47]. Table 1 provides the alphabets used for writing Kurdish in a comparative way.

Table 1.
  • Variations are specified with “/”.

Table 1. A Comparison of the Arabic- and Latin-based Alphabets of Kurdish

  • Variations are specified with “/”.

3.2 Vocabulary

The lexical diversity and richness of Kurdish has been previously attested by many lexicographers [10, 15, 30, 31, 44]. This diversity is to such an extent that the vocabulary may vary from one village to another. Moreover, being in touch with many regional languages, especially Arabic, Persian, Turkish, Armenian, and local languages, particularly Zazaki and Gorani, almost all Kurdish dialects have entered many lexical borrowings into the language as well [14]. Having an oral tradition in narrating poetry and prose, the oral literature has been considered as a source of vocabulary by lexicographers [4]. In addition, there is an ongoing struggle to develop modern technical terminologies for the language.

Regarding Kurdish lexicographic resources, Reference [5] survey the current state of Kurdish lexicography and state that despite the scarcity of resources in electronic forms for Kurdish, there are over 71 dictionaries and terminological resources for Kurdish that are not all recto-digitized.

3.3 Grammar

Despite the lexical similarity between the dialects of Kurdish, there are differences when it comes to grammar, particularly due to morphological constructions. Sorani tends to have a more complex morphological construction, while Kurmanji is less inflected. For instance, passive voice in Sorani is derived from the transitive verbs, while in Kurmanji, passive voice has a simpler construction where a compound is created by adding the auxiliary verb hatin “to come” to the transitive verb without any major morphological modification [49]. In addition, Sorani has a full article marking system where nouns are marked as definite, indefinite, demonstrative in singular and plural forms, while articles in Kurmanji are marked only in definite and demonstrative cases [24].

Regarding grammatical cases, unlike Sorani and English, Kurmanji has two grammatical genders, i.e., feminine and masculine, which implies a grammatical agreement particularly in Izafe (also known as Ezafe) constructions [42]. The Izafa construction refers to the usage of a grammatical particle to form noun phrases or adjective phrases. This grammatical particle in Kurmanji and Sorani are, respectively, -ê, -ekî, -a, -eke, -ên, and -î, -e [41, 48]. Although in the adjective phrases, the particle is not translated, e.g., xanîyêbiçûk “the small house,” in the noun phrases it is usually translated as “of,” e.g., xanîyê wî mirovî “the house of that man.”

Table 2 provides some of the major grammatical characteristics of Kurmanji, Sorani, and English. Both Kurdish dialects have a subject-object-verb alignment for present tenses and intransitive verbs and an agent-object-verb alignment for transitive verbs in the past tense. The morphosyntactic property of agreement of the subject of intransitive verbs as the object (patient) of transitive verbs in the past tenses is known as ergativity and also exists in Kurdish [25]. Unlike Kurmanji Kurdish, which uses oblique case of pronouns for this purpose, Sorani Kurdish only uses different pronominal clitics to demonstrate such an alignment [17].

Table 2.
LanguageWord orderPassiveGenderCaseAlignment
Kurmanji KurdishS-O-Vperiphrastic with hatin (to come) [48]feminine, masculine [48]nominative, oblique, Izafa, vocative [48]nominative–accusative, only in past transitive ergative–absolutive [33]
Sorani KurdishS-O-Vmorphological [49]no gender [49]nominative, locative, vocative [35]nominative–accusative, only in past transitive ergative–absolutive [25]
EnglishS-V-Operiphrasticno gendernominative, oblique, genitive only for personal pronounsnominative–accusative

Table 2. A Comparison of the Sorani and Kurmanji Dialects of Kurdish with English

It is worth mentioning that variations exist among Sorani subdialects, particularly the dialects that are categorized as Northern Sorani in Reference [34], which take use of oblique cases and grammatical gender to some extent.

Skip 4METHODOLOGY Section

4 METHODOLOGY

Multilingual news websites contain a large number of articles in various languages that can be considered a potentially parallel corpus. However, among the major Kurdish news agencies, listed in Table 3, none of them explicitly link identical articles across languages, e.g., by using reference keys or identical URL schema or news code. Moreover, only a few of them provide the same content in various languages. For instance, the English articles on BasNews are different in content and topic in comparison to the Kurdish ones.

Table 3.
agencylanguages
RûdawSorani, Kurmanji, English, Arabic, Turkish
VoiceofAmericaSorani, Kurmanji, English, Turkish, and many more
Kurdistan24Sorani, Kurmanji, English, Arabic, Turkish, Persian
KNNSorani, English, Arabic
FiratNewsAgencySorani, Kurmanji, Zazaki, Gorani, English, Arabic, Turkish, Persian, German, Russian, Spanish
BianetKurmanji, English, Turkish
BasNewsSorani, Kurmanji, English, Arabic, Turkish, Persian
KurdPaSorani, Kurmanji, English, Persian
GulanMediaSorani, Kurmanji, English, Arabic
NRTSorani, English, Arabic
SaharTVSorani, Kurmanji, English, Persian

Table 3. List of News Agencies Providing Content in Kurdish and Their Content Management Status

In this section, we describe our approach, which is illustrated in Figure 1, to create a parallel corpus of Sorani, Kurmanji and English. We refer to these three as languages for ease of reference.

Fig. 1.

Fig. 1. Our approach to automatically retrieve identical news articles.

4.1 Data Crawling

As the first step, we crawl the content of news websites. Our selection criteria are the editorial quality of the articles, accessibility of the data to be automatically scraped, and more importantly, multilingualism. Therefore, we selected Firat News Agency (ANF), BasNew (BN), and KurdPa (KP). Despite the remarkable size of articles published on Rûdaw and Kurdistan 24, we could not include those websites due to crawling restrictions. Moreover, our findings regarding the alignment of Voice of America was not satisfying due to sparsity of topics across languages.

Once the news articles are crawled, we clean the HTML files and extract the following information from each page:

  • tag: a list of the tags used for identifying the article. For this purpose, bashakan, cat-links, keywords tags were originally used in BN, KP, and VOA, respectively. In the case of ANF, we used the page hyperlink structure to extract the topic and used it as a tag.

  • original_link: the original link to the article on the website

  • dialect: the dialect of the article retrieved using the link schema, usually so for Sorani and ku for Kurmanji

  • entry-title: the news headline

  • entry-lead: the news sub-headline, if provided

  • date: the publication date of the article. We unified all the date formats based on the Gregorian calendar given the variety of calendars, e.g., Kurdish or Persian calendars

  • entry-content: a list containing paragraphs, i.e., <p>, provided in the content of each news article. The content of our target websites are originally marked with the <entry-content> tag.

  • imgs: Assuming that relevant news articles link to the same multimedia content with the same hyperlink, we retrieve the hyperlinks associated to the <img> tags within the body of the article.

In addition to the HTML tags, in some cases, we could use JSON-LD and the meta tags, i.e., <meta>, to retrieve further instances. Ultimately, the news articles of each website are normalized and categorized by dialect and language in JSON format.

4.2 Corpus Filtering

Given two sets of articles of the same news website in two languages, we consider two articles alignable if they, at least, have one common tag and identical publication dates with the exact month and year. Intuitively speaking, two articles published in two different years with two different tags (topics) are less probable to contain the same content. In addition to this, we also use <imgs> to filter the articles in such a way that if two articles are linked to the same image, we consider them potentially alignable.

Moreover, as several news articles could be published with the same tags within the same date range, we further filter out the candidate articles by comparing the headlines. To do so, we calculate the similarity of the headlines based on the a simple string sequence matching scorer. In the case of Sorani, as it is written in the Arabic-based alphabet, we first transliterate the Sorani text, using Wergor [2], into the Latin-based script that is used for Kurmanji and English.

As the final step, among the candidate headlines, we retrieve the top five most similar headlines in the other language. These headlines are then provided in spreadsheets to native annotators who determine if two headlines correspond to the same news content using a drop-down list. If two headlines are literal translations and refer to the same content, then they are specified as equivalent. However, this is not always the case, as some headlines are paraphrases and rewritten in such a way that they attract the readers’ attention. In such cases where two headlines refer to the same content but are not literal translations, they are annotated as possible. Although we do not consider such headlines as a translation pair, they are essential to retrieve relevant contents. In the cases where the headlines do not provide sufficient information to decide their relatedness, annotators are asked to check the crawled data in the two languages manually. Figure 2 in Appendix A illustrates an annotation example in Kurmanji and English.

4.3 Content Alignment

As the result of the previous steps, a list of the alignable articles of the same news website in two languages is available. Using the aligned headlines, we collect their contents, i.e., the content of <entry-lead> and <entry-content>, and provide them in two separate files where paragraphs and articles are, respectively, separated by one and two new lines. These files are then provided to the native annotators who extract parallel sentences and phrases in the two languages using InterText [51]. InterText6 is an editor for aligning parallel texts and provides a wide range of editing functions such as merge, split, and positioning.

In the manual alignment task, we extract translation pairs based on the following guidelines:

(1)

the length of the sentences or phrases should be within a reasonable range. If too long, then they are to be split into smaller phrases;

(2)

idiomatic translations are validated as long as they do not add to the size of the sentence significantly;

(3)

if the translation of a sentence is provided in many separate sentences or phrases, then the annotator is allowed to merge the sentences to create a valid translation pair;

(4)

if two sentences can be validated with slight modifications, such as punctuation marks or digits, then the annotator is allowed to edit the content.

Skip 5EVALUATION Section

5 EVALUATION

Table 4 presents basic statistics of the corpus where the whole number of crawled articles and the number of retrieved articles among them are provided. We also specify the number of articles that are retrieved using multimedia hyperlinks using <img>.

Table 4.
  • <img> refers to the articles retrieved through the image URLs in the HTML source code.

Table 4. Statistics of the Kurmanji (kmr), Sorani (ckb), and English (eng) Articles Used to Create Our Parallel Corpus

  • <img> refers to the articles retrieved through the image URLs in the HTML source code.

In all the translation pairs, 17 to 20 tokens are on average present in each sentence. In contrast, the average number of tokens in Tanzil, TED, and KurdNet corpora is, respectively, around 25, 70, and 6. As such, we believe that our resources are comparatively better when it comes to automatic alignment.

In addition to the basic statistics, we used Moses [28] to test and evaluate the usage of the corpus in the statistical machine translation. We divided the corpus into two sets: 90% as a training set and 10% as a test set. The training set received a higher percentage because of the relatively small size of our corpora. The sets were selected randomly. We prepared the random selection scripts in a way that the whole experiment is reproducible. We trained Moses according to its recommended procedures.7 We also tested the accuracy of the system based on the Moses guideline that provides the BLEU [39] evaluation based on the test set. Table 5 presents the results of BLEU scores for the Sorani-English, Kurmanji-English, and Sorani-Kurmanji data.

Table 5.
Baseline systemBLEU
Sorani-Kurmanji17.08
Sorani-English17.74
Kurmanji-English11.06

Table 5. Results of a Baseline Statistical Machine Translation System Trained on Our Parallel Corpus

In addition to a considerable amount of data, the performance of the baseline system relies on other important tasks, particularly tokenization. Given that in our baseline system sentences are tokenized based on spaces, we believe that the performance could be improved significantly with a language-specific tokenization tool for Kurdish, such as the one described in Reference [3]. This task is of importance due to the morphologically complex word forms in Kurdish that make alignment of sentences challenging.

Skip 6CONCLUSION AND FUTURE WORK Section

6 CONCLUSION AND FUTURE WORK

In this article, we report our efforts in creating a parallel corpus for the Kurdish language as a less-resourced language. Given that manual translation is an expensive and tedious task, we used the content of multilingual Kurdish news websites to extract potentially alignable Sorani, Kurmanji, and English sentences in a semi-automatic manner. The candidate sentences are then provided to native speakers to validate if they are translation pairs. This way, the task of translation is carried out as an annotation task. Our corpus contains 12,327 Sorani-Kurmanji, 1,797 Kurmanji-English, and 650 Sorani-English translation pairs.

As the material for machine translation, we believe that our resource can pave the way for further developments in Kurdish machine translation. To facilitate the alignment of the news articles, we also propose that a referencing mechanism be embedded within each news article so corresponding texts could be linked more easily in the future. We would also like to suggest our approach to further extend the current corpus or create new corpora for the other dialects of Kurdish. Furthermore, machine translation is one of the important future tasks that should be addressed for Kurdish. Various tasks related to machine translation should be addressed, especially using the more advanced techniques relying on neural network methods.

A APPENDIX

Fig. A.2.

Fig. A.2. An example of the alignment of headlines. For each headline in English (left column), the five most similar headlines among the filtered Kurmanji headlines are provided. Using the drop-down list in the middle column, the annotator determines if two headlines are literal translations by selecting equivalent or if they are not literal translation but correspond to each other by selecting possible.

Fig. A.3.

Fig. A.3. Examples of good translation pairs in our corpus.

Footnotes

REFERENCES

  1. [1] Abdulrahman Roshna, Hassani Hossein, and Ahmadi Sina. 2019. Developing a fine-grained corpus for a less-resourced language: The case of Kurdish. In Proceedings of the ACL Widening Natural Language Processing Workshop (WiNLP ACL’19).Google ScholarGoogle Scholar
  2. [2] Ahmadi Sina. 2019. A rule-based Kurdish text transliteration system. Asian Low-Resour. Lang. Inf. Process. 18, 2 (2019), 18:1–18:8.Google ScholarGoogle Scholar
  3. [3] Ahmadi Sina. 2020. A tokenization system for the Kurdish language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), 114127. Retrieved from https://aclanthology.org/2020.vardial-1.11.Google ScholarGoogle Scholar
  4. [4] Ahmadi Sina, Hassani Hossein, and Abedi Kamaladdin. 2020. A Corpus of the Sorani Kurdish folkloric lyrics. In Proceedings of the 1st Joint Spoken Language Technologies for Under-resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop at the 12th International Conference on Language Resources and Evaluation (LREC).Google ScholarGoogle Scholar
  5. [5] Ahmadi Sina, Hassani Hossein, and McCrae John P.. 2019. Towards electronic lexicography for the Kurdish language. In Proceedings of the eLex Conference. 881906.Google ScholarGoogle Scholar
  6. [6] Ahmadi Sina and Masoud Maraim. 2020. Towards machine translation for the Kurdish language. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics, 8798. Retrieved from https://aclanthology.org/2020.loresmt-1.12.Google ScholarGoogle Scholar
  7. [7] Aliabadi Purya, Ahmadi Mohammad Sina, Salavati Shahin, and Esmaili Kyumars Sheykh. 2014. Towards building Kurdnet, the Kurdish Wordnet. In Proceedings of the 7th Global Wordnet Conference. 16.Google ScholarGoogle Scholar
  8. [8] Allah Fadoua Ataa and Boulaknadel Siham. 2012. Toward computational processing of less resourced languages: Primarily experiments for Moroccan Amazigh language. In Theory and Applications for Advanced Text Mining, Ch 9, Shigeaki Sakurai (Eds.). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ataman Duygu. 2018. Bianet: A parallel news corpus in Turkish, Kurdish and English. arXiv preprint arXiv:1805.05095 (2018).Google ScholarGoogle Scholar
  10. [10] Bedirxan Celadet Ali and Keskin Abdullah. 2009. Ferheng: Kurdî, Kurdî (Kurdish-Kurdish dictionary) (Kurmanji). Vol. 2. Avesta.Google ScholarGoogle Scholar
  11. [11] Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. WIT3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation. 261268.Google ScholarGoogle Scholar
  12. [12] Chéragui Mohamed Amine. 2012. Theoretical overview of machine translation. In Proceedings of the 4th International Conference on Web and Information Technologies (ICWIT’12). 160169.Google ScholarGoogle Scholar
  13. [13] Chiruzzo Luis, Amarilla Pedro, Ríos Adolfo, and Lugo Gustavo Giménez. 2020. Development of a Guarani-Spanish parallel corpus. In Proceedings of The 12th Language Resources and Evaluation Conference. 26292633.Google ScholarGoogle Scholar
  14. [14] Chyet M. L.. 2020. FERHENGA BIRÛSKÎ Kurmanji - English Dictionary Volume One: A - L. Transnational Press London. Retrieved from https://books.google.ie/books?id=dVrIDwAAQBAJ.Google ScholarGoogle Scholar
  15. [15] Chyet Michael L. and Schwartz Martin. 2003. Kurdish-English Dictionary. Yale University Press.Google ScholarGoogle Scholar
  16. [16] Esmaili Kyumars Sheykh, Eliassi Donya, Salavati Shahin, Aliabadi Purya, Mohammadi Asrin, Yosefi Somayeh, and Hakimi Shownem. 2013. Building a test collection for Sorani Kurdish. In ACS International Conference on Computer Systems and Applications (AICCSA’13). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Esmaili Kyumars Sheykh and Salavati Shahin. 2013. Sorani Kurdish versus Kurmanji Kurdish: An empirical comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2. 300305.Google ScholarGoogle Scholar
  18. [18] Fry John. 2005. Assembling a parallel corpus from RSS news feeds. In Proceedings of the MT Summit X.Google ScholarGoogle Scholar
  19. [19] Haig Geoffrey and Matras Yaron. 2002. Kurdish linguistics: A brief overview. STUF - Language Typology and Universals 1, 55 (2002), 3–14. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hassani Hossein. 2017. Kurdish interdialect machine translation. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’17). 6372.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hassani Hossein. 2018. BLARK for multi-dialect languages: Towards the Kurdish BLARK. Lang. Resour. Eval. 52, 2 (2018), 625644.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Hutchins John. 2005. Current commercial machine translation systems and computer-based translation tools: System types and their uses. Int. J. Translat. 17, 1–2 (2005), 538.Google ScholarGoogle Scholar
  23. [23] Inoue Go, Habash Nizar, Matsumoto Yuji, and Aoyama Hiroyuki. 2018. A parallel corpus of Arabic-Japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google ScholarGoogle Scholar
  24. [24] Jügel Thomas. 2014. On the linguistic history of Kurdish. Kurd. Stud. 2, 2 (2014), 123142.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Karimi Yadgar. 2014. On the syntax of ergativity in Kurdish. Poznan Stud. Contemp. Ling. 50, 3 (2014), 231271.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Khalid Hewa Salam. 2015. Kurdish dialect continuum, as a standardization solution. Int. J. Kurd. Stud. 1, 1 (2015), 2739.Google ScholarGoogle Scholar
  27. [27] Koehn Philipp, Guzmán Francisco, Chaudhary Vishrav, and Pino Juan. 2019. Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). 5472.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Koehn Philipp, Hoang Hieu, Birch Alexandra, Callison-Burch Chris, Federico Marcello, Bertoldi Nicola, Cowan Brooke, Shen Wade, Moran Christine, Zens Richard, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Koehn Philipp, Khayrallah Huda, Heafield Kenneth, and Forcada Mikel L.. 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. 726739.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Rohani Majed M.. 2012. University of Kurdistan Dictionary: Persian-Kurdish. Vol. 3. University of Kurdistan, Sanandaj Iran.Google ScholarGoogle Scholar
  31. [31] Rohani Majed M.. 2018. University of Kurdistan Dictionary: Kurdish-Kurdish-Persian. Vol. 4. University of Kurdistan, Sanandaj Iran.Google ScholarGoogle Scholar
  32. [32] Malmasi Shervin. 2016. Subdialectal differences in Sorani Kurdish. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VARDIAL’16). 8996.Google ScholarGoogle Scholar
  33. [33] Matras Yaron. 1997. Clause combining, ergativity, and coreferent deletion in Kurmanji. Stud. Lang. Int. J. spons. Found. “Found. Lang.” 21, 3 (1997), 613653.Google ScholarGoogle Scholar
  34. [34] Matras Yaron. 2017. Revisiting Kurdish dialect geography: Preliminary findings from the Manchester Database.Google ScholarGoogle Scholar
  35. [35] McCarus Ernst M.. 2007. Kurdish morphology. Morphol. Asia Afr. 2 (2007), 10211049. http://kurdish.humanities.manchester.ac.uk/wp-content/uploads/2017/07/PDF-Revisiting-Kurdish-dialect-geography.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mino Hideya, Tanaka Hideki, Ito Hitoshi, Goto Isao, Yamada Ichiro, and Tokunaga Takenobu. 2020. Content-equivalent translated parallel news corpus and extension of domain adaptation for NMT. In Proceedings of the 12th Language Resources and Evaluation Conference. 36163622.Google ScholarGoogle Scholar
  37. [37] Morishita Makoto, Suzuki Jun, and Nagata Masaaki. 2020. JParaCrawl: A large scale web-based English-Japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 36033609. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.443.Google ScholarGoogle Scholar
  38. [38] Mubarak Hamdy, Hassan Sabit, and Abdelali Ahmed. 2020. Constructing a bilingual corpus of parallel tweets. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora. 1421.Google ScholarGoogle Scholar
  39. [39] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Resnik Philip and Smith Noah A.. 2003. The web as a parallel corpus. Comput. Ling. 29, 3 (2003), 349380. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Salehi Ali. 2018. Constraints on Izāfa in Sorani Kurdish. In Theses and Dissertations–Linguistics 31. University of Kentucky. Retrieved from https://uknowledge.uky.edu/ltt_etds/31.Google ScholarGoogle Scholar
  42. [42] Samvelian Pollet. 2007. The Ezafe as a head-marking inflectional affix: Evidence from Persian and Kurmanji Kurdish. In Aspects of Iranian Linguistics: Papers in Honor of Mohammad Reza Bateni, Karimi S., Samiian V., and Stillo D. (Eds.). Cambridge Scholars LTD, 339361. Retrieved from https://halshs.archives-ouvertes.fr/halshs-00673182.Google ScholarGoogle Scholar
  43. [43] Sen Sukanta, Ekbal Asif, and Bhattacharyya Pushpak. 2019. Parallel corpus filtering based on fuzzy string matching. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). 289293.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Sharafkandi Abdolrahman (Hejar). 1991. Hanbana Borina: Kurdish-Persian Dictionary. Vol. 2. Soroush, Tehran.Google ScholarGoogle Scholar
  45. [45] Steingrímsson Steinthór, Loftsson Hrafn, and Way Andy. 2020. Effectively aligning and filtering parallel corpora under sparse data conditions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 182190.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Tadić Marko. 2000. Building the Croatian-English parallel corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. 523530.Google ScholarGoogle Scholar
  47. [47] Tavadze Givi. 2019. Spreading of the Kurdish language dialects and writing systems used in the Middle East. Bull. Georg. Natl. Acad. Sci 13, 1 (2019).Google ScholarGoogle Scholar
  48. [48] Thackston Wheeler M.. 2006. Kurmanji Kurdish:-A Reference Grammar with Selected Readings. Harvard University.Google ScholarGoogle Scholar
  49. [49] Thackston Wheeler M.. 2006. Sorani Kurdish–A Reference Grammar with Selected Readings. Harvard University.Google ScholarGoogle Scholar
  50. [50] Toral Antonio. 2014. TLAXCALA: A multilingual corpus of independent news. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 36893692. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf.Google ScholarGoogle Scholar
  51. [51] Vondřička Pavel. 2014. Aligning parallel texts with InterText. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).18751879.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Asian and Low-Resource Language Information Processing
              ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
              September 2022
              486 pages
              ISSN:2375-4699
              EISSN:2375-4702
              DOI:10.1145/3533669
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 29 April 2022
              • Online AM: 3 February 2022
              • Accepted: 1 January 2022
              • Revised: 1 November 2021
              • Received: 1 October 2020
              Published in tallip Volume 21, Issue 5

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!