A CNN-Based Arabic Diacritic Symbol Recognition System Using Domain Adaptation

The recognition of Arabic diacritic symbols is essential for accurate comprehension of Arabic texts, yet most existing Optical Character Recognition (OCR) systems lack this capability. This paper proposes a novel approach for Arabic diacritic recognition using a domain adaptation method applied to the AlexNet deep convolutional neural network architecture. A custom dataset of synthetic Arabic diacritic images was generated, incorporating variations in font styles, sizes, slant angles, and perspective distortions. The model was trained using this dataset, and a domain adaptation technique was employed to improve its generalization across diverse diacritic styles and handwriting variations. We evaluated the proposed model on the testing dataset consisting of a variety of fonts and distortions; the result was an overall accuracy of 97%, indicating that the model can achieve a high accuracy for most diacritic categories, with only a few misclassifications.


INTRODUCTION
Arabic is a language of significant cultural, historical, and linguistic importance, spoken by millions of people across the globe.With a rich and complex script, the written form of Arabic incorporates diacritic symbols, which play a pivotal role in conveying precise pronunciation and semantic nuances.
Even without diacritics, native Arabic speakers can, in most cases, tell what diacritics should belong to the letters of a word based on the context of the sentence in which they are used.However, diacritics are still used in many domains such as in education and historical books; thus, they will always be a foundational part of the Arabic language that cannot be dismissed.Consequently, diacritic recognition is indispensable for accurate comprehension of Arabic texts, especially in crucial domains such as education, communication, and language processing.
Optical Character Recognition (OCR) systems have advanced in recent years, revolutionizing the digitization of written texts.However, most current OCR systems for Arabic text lack the capability to recognize diacritic symbols (see Part 2).Due to the fact that diacritic symbols are subtle, they can be confused for other Arabic symbols such as dots or just as noise.Thus, there is a need for research and development in the field of Arabic diacritic recognition.
In this paper, we propose and evaluate a novel approach for improving current OCR systems for Arabic by developing a recognition model that specifically focuses on the identification and interpretation of Arabic diacritic symbols.We specifically propose a deep-learning-based approach which uses the AlexNet architecture [11] trained on a large, custom, and computer-generated dataset of Arabic diacritic symbols.Furthermore, we also incorporate a custom CNN model that integrates a domain adaptation method to enhance the model's generalization capability across diverse diacritic styles and handwriting variations.
This paper is organized as follows: Part 2 provides an overview of related work on OCR.Part 3 presents the proposed method, including details of the dataset and training.In Part 4, we analyze and discuss the performance of the method.General conclusions are presented in Part 5.

RELATED WORK
Although there are many related works on OCR for the Arabic language, most focus primarily on recognizing and segmenting only words and letters of Arabic.There are very few research papers that talk about segmenting Arabic diacritics, and even fewer papers that introduce a recognition method for Arabic diacritics.
The only recent and relevant paper to propose a method to segment diacritics was from Sheikh et al. [17].They proposed a region-based approach that can segment diacritics from their attached letters.However, their method is rather simple and prone to errors, where other symbols such as dots from letters can be mistaken as diacritics.
When it comes to creating OCR models for Arabic, there are broadly three main approaches: (1) Traditional methods, (2) naturallanguage-processing-based methods, and (3) vision-based machine learning methods.

Traditional Image Processing Methods
For Arabic OCR, many papers exist for implementing an OCR system using traditional methods.Traditional methods in OCR refer to the use of conventional image processing techniques and feature extraction approaches to implement OCR systems.They are great for simple, general-purpose tasks.However, when it comes to dealing with complex and diverse patterns, such as handwritten texts, they can be inaccurate.
Qaroush et al. [15] proposed a traditional approach that uses a vertical projection method that incorporates statistical and topological features.While they achieved an accuracy of 97.51%, their method was tested only on digital computer fonts and was not tested on handwritten fonts.Al Ghamdi [1] used a traditional method based on a novel feature extraction method that took in account certain unique features of Arabic words such as baseline detection; these features were then input to a long decision tree for classification.The classification accuracy was reported to be 77.3%accuracy.

Natural Language Processing Methods
Natural Language Processing (NLP) methods offer a unique approach in addressing errors in existing OCR systems by leveraging linguistic context to correct inaccuracies.These "hybrid" systems combine OCR and NLP techniques to enhance OCR accuracy and provide more contextually relevant results.Unlike traditional methods that may struggle with complex character recognition, NLPbased approaches harness the power of language understanding to improve OCR output.Although these methods have shown promise in certain applications, they also come with specific challenges when it comes to Arabic diacritics.
Aliwy and Al-Sadawi [4] proposed an NLP-based Arabic OCR system, that can correct the errors from an OCR system using NLP based on the context of the word in the sentence.They had an average of correction of approximately 7.96%.Following a similar approach, Doush [6] was able to reduce the errors when using the hybrid OCR system from 14.95% to 14.42%.However, in order to apply a similar approach to diacritic OCR, we would need to also create another NLP model that is capable of taking diacritics into consideration; this task is computationally taxing, and it remains unclear any potential performance gains in recognition could justify the extra computational cost.
Various papers have suggested an approach that would generate diacritics-a task called diacritization-by using NLP methods to predict the diacritics of a word based on how the word was used in the context of the text.Masmoudi et al. [13] proposed two approaches: one of which was rule-based approach and the other based on conditional random fields (CRFs).The model had a character error rate of 10.47%.Al Sallab et al. [2] introduced a novel model called Confused Subset Resolution which used sub-classifiers that tried to resolve confusions.The model had a character diacritization accuracy of 86.3%.Elshafei et al. [7] proposed a more traditional method for automatic diacritization using a Hidden Markov Model and the Viterbi algorithm; they reported a 72% accuracy.Fadel et al. [8] proposed a diacritization method that used a FeedForward Neural Network and a Recurrent Neural Network; they reported an accuracy of over 95%.However, it is important to note that these diacritization methods were proposed for text that did not originally contain diacritics.In an OCR setting, it would be redundant to use a predictive, generative NLP method to generate the diacritics when they already exist in the original text.To make matters worse, the accuracies of these generative models are generally lower than those of recognition models.

Vision-Based Machine Learning Methods
Machine learning methods have revolutionized OCR systems, demonstrating remarkable accuracy and robustness in character recognition tasks.These methods have surpassed traditional approaches in handling complex datasets and achieving high accuracy levels.In recent research, several machine learning-based OCR models have been proposed, each showcasing impressive results on specific datasets.Darwish and Elzoghaly [5] reported an accuracy of 98.8% using both a Genetic Algorithm and the Fuzzy K-Nearest Neighbor classifier.Unfortunately, the model was tested only on digital computer fonts, and the paper mentions that the model performs poorly against handwritten fonts.
Alghyaline [3] used a model based on Yolo v4; an accuracy of 95.7% was reported for character recognition.Wagaa et al. [18] recently proposed a deep-learning-based Arabic OCR method that used a custom CNN.They achieved an accuracy of 98.48% and 91.24% on the AHCD and Hijja datasets, respectively.
One of the popular deep learning models used for OCR tasks is AlexNet, which has demonstrated promising results in languages that utilize Arabic letters.KO and Poruran [10] conducted a study using AlexNet-based and GoogleNet-based neural networks.Their experiments yielded impressive accuracies of 96.3% and 94.7%, respectively, highlighting the effectiveness of deep learning for character recognition in Arabic-based languages.
In the context of Urdu OCR, Rasheed et al. [16] proposed a model that combined AlexNet with Support Vector Machine (SVM) for training.To enhance the model's performance and prevent overfitting, they applied data augmentation techniques to their dataset.As a result, they achieved a remarkable classification accuracy of 97.08% for recognizing Urdu characters, further reinforcing the advantage of deep learning-based OCR models.
While AlexNet has demonstrated remarkable performance in general image classification tasks, its generic nature limits its ability to fully exploit the unique features of diacritics and OCR in general.To address this limitation and further enhance the recognition model, the incorporation of an additional method became essential.
Extensive research and experimentation led to the exploration of Domain Adaptation as a valuable tool for the recognition model.Ganin and Lempitsky [9] introduced the concept of Domain Adaptation as a versatile approach capable of identifying domain-invariant features within sets of images.Later, Wan et al. [19] applied Domain Adaptation in the context of an English OCR system, highlighting its potential application for character recognition systems in the future.Wan et al. [19] proposed a multi-dimensional domain adaptation framework that effectively aligns the feature representations across different scenes, thereby improving the model's generalization capability.They proposed the use of a domain classifier (to predict the source of the data) and a gradient reversal layer (to encourage the model's feature representations to be domain-invariant).By combining these techniques, their model was reported to be capable of recognizing objects and patterns across multiple scenes, even in the presence of significant domain shifts.
Inspired by these previous works, in the following section, we present details of our deep-learning-based diacritic recognition model which also makes use of domain adaptation.By incorporating domain adaptation techniques, we are able to enhance the model's ability to recognize diacritic symbols across diverse font styles and handwriting variations, effectively reducing the domain gap.This adaptation is particularly crucial as our model is intended to handle a wide range of digital computer fonts and handwritten fonts, making it more practical and robust in real-world scenarios.
Previous research showed that while traditional image processing methods are useful for simple tasks, they can struggle with the complexity of handwritten texts.NLP-based methods have proven to be useful for many different applications, but for the application of Arabic diacritic recognition, they have yet to show results that are more accurate than deep-learning methods.While Yolo v4 is a powerful object detection model widely used for various computer vision applications, it is primarily designed for detecting and localizing objects in images, making it more suitable for tasks like object detection and segmentation.On the other hand, AlexNet was designed and optimized for image classification tasks and has demonstrated promising results specifically in character recognition tasks, which aligns well with the objectives of OCR for Arabic-based languages.

METHODOLOGY 3.1 Diacritic Selection
In this study, our recognition model specifically targeted 14 diacritics, as shown in Figure 1.The selection of these diacritics is deliberate, as they encompass the diacritic symbols that most OCR systems struggle to recognize accurately.Moreover, these diacritics represent the most frequently used diacritics in the Arabic language.
It is worth noting that some research papers [12] include the dots within letters like or as diacritics.While this perspective is subject to debate, current OCR systems already possess the ability to recognize these letters independently without the need for further segmentation into "diacritic dots." As such, there is no requirement to classify these types of diacritics separately.
Another aspect that may be considered as a diacritic is the letter , which can function as a "diacritic letter" in certain combinations, such as , and .However, given that this diacritic only appears in a limited number of letters (3 letters in total), it is more practical to include these letters within an Arabic letter recognition model, rather than a dedicated diacritic recognition model.By doing so, we significantly simplify the detection process while preserving the accurate recognition of these specific letter-diacritic combinations.

Dataset Generation
The dataset used for training and evaluating our model is generated using a custom Python script.The script creates synthetic Arabic diacritic images with variations in font styles, slant angles, and perspective distortions (see Figure 2).The dataset generation script reads fonts from the "fonts" folder, which should contain TrueType (.ttf) or OpenType (.otf) font files.It then generates diacritic symbols for each Arabic letter, creating multiple variations for each letter to increase dataset diversity.To ensure the model's robustness and ability to handle diverse scenarios, we incorporated a total of 92 fonts, consisting of 46 digital computer fonts and 46 handwritten fonts.For each variation, the following steps are performed: • A blank grayscale image of size 128x128 pixels is created with a black background.• A font size of 12 is selected, and the corresponding Arabic letter is drawn in white using the chosen font.• Random perspective distortions, slant angles, and skewness distortions are applied to create realistic variations of the diacritic symbol.• Film grain is added to create subtle texture variations and improve model robustness.• The resulting image is saved in JPEG format to add JPEG artifacts to the image to add additional realistic variations, and then converted back to PNG, and the label information is stored in a CSV file.

Dataset Description
The dataset consists of Arabic diacritic images with 14 distinct classes, where each class represents a specific diacritic symbol.For each diacritic, we generated 20 variations to diversify the dataset and represent different font styles, sizes, slant angles, and perspective distortions.Although the number of variations can be varied, we decided on 20 variations per diacritic to achieve a total dataset size of around 25,000 images.25,000 images were selected as our target dataset size to reduce overfitting and ensure that any accuracy loss that could occur would not be due to a small dataset.
The dataset was split into training and testing sets in a ratio of approximately 2/3 and 1/3, respectively.Out of the total 92 fonts available, the training set includes 62 unique fonts, while the testing set comprises 30 unique fonts (see Figure 3).By using unique fonts in each set, we ensure that the model is exposed to unseen font styles during testing, making it more robust and practical for realworld scenarios.

Data Preprocessing
Before inputting the images into the model, the dataset images were preprocessed via the following steps to ensure consistency and enhance model performance.
Grayscale Conversion and Resizing.The images in the dataset were initially generated in grayscale to simplify the recognition task and reduce computational complexity.Grayscale images are known to capture essential features of the characters, making them suitable for diacritic recognition.These grayscale images were then resized to a standardized resolution of 128x128 pixels.The selected resolution balances model complexity and computational efficiency while preserving crucial details.
Normalization.To ensure a consistent range of pixel intensity values, normalization was applied to bring all pixel values in the range [0,1].This normalization helps with model convergence during training.

Data Augmentation Techniques
Perspective Distortions.Perspective distortions were introduced to simulate variations in viewpoint and angles.The amount of perspective distortion was controlled by a randomly generated factor, ranging from 0 to 0.05.This small range helps to keep the distortions within realistic bounds, ensuring that the generated images appear natural and representative of human variations.Slant Variations.Slant variations were applied to imitate handwriting variations where characters may be slightly tilted.The slant angle is randomly set within a range of -5 to 5 degrees.Like perspective distortions, the chosen range allows for small but realistic slant variations.
Skewing and Blur.Skewing and blur were introduced to emulate real-world writing conditions, accounting for minor imperfections and variations in handwritten diacritic symbols.The skewing effect is achieved by applying a perspective transformation to the image.A transformation matrix is used to define the skew effect.The values of this matrix are set within a range of -0.05 to 0.05.A Gaussian blur is applied to the image, and the radius parameter (standard deviation) is set to a random value within the range of 0 to 0.2.Both methods used small ranges to avoid extreme deformations and maintain legibility.
Film Grain.Film grain was added to the images to enhance model robustness and simulate the texture variations present in scanned or real-world handwritten texts.The intensity of the grain was set randomly between 0 and 10, and the grain size varied from 1 to 6.These values strike a balance between adding realistic texture and avoiding excessive noise.

JPEG Compression Artifacts.
To mimic the imperfections introduced by image compression, random JPEG compression artifacts were induced to the images during data augmentation.The quality level of the JPEG compression varied randomly from 50 to 100.This allows the model to handle variations introduced by image compression commonly found in real-world scenarios.

Model Architecture
The proposed model for Arabic diacritic recognition is based on the AlexNet architecture, a deep convolutional neural network known for its success in image classification tasks.The AlexNet model was adapted for grayscale image inputs and consists of three convolutional layers with ReLU activation, followed by max-pooling layers.The features extracted by the convolutional layers are then fed into a flattening layer, diverges into two classifiers, one that will go through diacritic classification, and one that will go through domain classification (see Figure 4).
The AlexNet architecture has demonstrated outstanding performance in image classification due to its ability to learn hierarchical features from raw pixel inputs [16] [10].However, in the context of Arabic diacritic recognition, there exists a significant domain gap between the digital computer dataset (fonts) and real-world data, such as handwritten diacritic symbols in various styles.This domain shift can adversely impact the model's performance when applied to real-world scenarios.
To address the domain adaptation challenge, we extend the AlexNet model by incorporating a domain adaptation method.The domain adaptation technique aims to reduce the domain gap between the synthetic and real-world data distributions, enhancing the model's ability to generalize across different font styles and handwriting variations.
The domain adaptation method involves the addition of a domain classifier to the network.This classifier is responsible for predicting the source of the data, distinguishing between samples from the training set and the testing set.The domain classifier is trained to minimize its accuracy, while the main diacritic classification task is optimized to maximize accuracy.This seemingly contradictory objective for the domain classifier is achieved using a gradient reversal layer.
During training, the gradient reversal layer multiplies the gradients of the domain classifier's parameters by a negative constant (), effectively flipping the gradient updates during the backpropagation process.The value of  controls the strength of the gradient reversal operation.Larger values of  increase the strength of the domain adaptation process, making it more domain-invariant.This process makes the features learned by the model more domaininvariant, as the model is encouraged to produce similar feature representations for both digital computer fonts and handwritten fonts.Consequently, the model becomes less sensitive to domain shift, leading to improved recognition performance when applied to real-world diacritic symbols.The process of the gradient reversal can be expressed via: Where  denotes the loss function,  represents the domain classifier,  is the chosen constant, and ℎ represents the feature extracted from the previous layer.In simple terms,   represents the gradients of the domain classifier, and  ℎ represents the gradients during backpropagation of the features, with respect to the classification loss .
By incorporating the domain adaptation method, our diacritic recognition model not only performs well on the dataset but also demonstrates enhanced generalization capability across diverse font styles and handwriting variations, ensuring its practical applicability in various real-world OCR scenarios.

Model Training
The model was trained using the Adam optimizer with a learning rate of 0.001 and cross-entropy loss as the loss function.To strike a balance between computational efficiency and model performance, we selected a batch size of 16.Given the large number of images in our training dataset, a batch size of 16 enables us to leverage the benefits of mini-batch gradient descent, effectively updating the model's parameters with reduced computation time per iteration.
Regarding the number of epochs, we conducted experiments to determine the optimal value.After extensive analysis, we found that the recognition model achieved relatively stable accuracy after approximately 15 epochs.The accuracy tended to fluctuate slightly, but there were no significant differences beyond this point.Therefore, to ensure sufficient training and capture any subtle improvements, we set the number of epochs to 20.This decision accounts for the large dataset size, where a greater number of epochs may lead to overfitting, while a smaller number of epochs could result in an undertrained model.After training, there would be two outputs from the model, the class output, and the domain output.To correctly evaluate the performance of the model, we need to develop a method to correctly measure the total loss of our model.To this end, a domain adaptation hyperparameter   was used to control the importance of the domain adaptation loss during training.The domain adaptation technique aims to align the feature representations of digital computer fonts and handwritten fonts, effectively reducing the domain gap.The value of   determines the relative influence of the domain adaptation loss compared to the diacritic classification loss.The calculation of the total loss can be expressed as follows: After conducting extensive experiments, we chose   value of 0.1.This value strikes a balance between the importance of the domain adaptation objective and the diacritic classification objective.A smaller value of   may result in insufficient adaptation, leading to limited improvement in domain generalization.Conversely, a larger value of   may overshadow the diacritic classification task, hindering the model's ability to accurately recognize diacritic symbols.By setting   to 0.1, we encourage the model to adapt its feature representations to be domain-invariant while maintaining a strong focus on diacritic classification performance.
For the domain adaptation process, we are required to choose a negative constant () that will control the overall strength of the domain adaptation method.After multiple experiments, we chose our  value to be 0.1.We found that, due to the small variations that occur between each diacritic image, a large  value reduces the accuracy.This finding can be explained by the fact that, although there is a domain shift that occurs when training the model, because the shift is relatively small, a large value of  tends to lead to reduced recognition accuracy.

RESULTS
The trained model was evaluated on the 8,400 images in the testing set of our database.We evaluate the recognition performance in terms of precision, recall, F1-score, and overall accuracy given by: Where TP and TN denote the number of true positive and true negative predictions, respectively; and FP and FN denote the number of false positive and false negative predictions, respectively.Table 1 shows the recognition performance on the testing dataset (including results in terms of individual diacritics) in terms of these performance metrics.As shown in Table 1, the overall accuracy is 0.97, indicating that the proposed model can achieve excellent recognition capability across diverse diacritic styles and handwriting variations.
Figure 5 shows the confusion matrix of the model on the testing set.From this confusion matrix, we can observe that the model generally achieved high accuracy for most diacritic categories, with only a few misclassifications.
One possible cause of misclassification is due to the complexity and similarity between diacritics.The diacritics with the most misclassifications were Shadda with Damma, Shadda with Alef, and Shadda with Dammatan (see Figure 6).In general, compared to other diacritics, these diacritics consist of relatively complex shapes with similar features.Thus, it is not surprising that the recognition model would exhibit some confusion between these diacritics.
Another possible cause of misclassification is due to the presence of subtle strokes that can be degraded with small font sizes.In both cases, we see that the difference between the diacritics is that of an additional stroke.Misclassification can occur when the font size is very small.Since the additional stroke is very small and subtle, when it comes it to small fonts, the additional stroke becomes very close to the other stroke, so the model has a difficult time recognizing whether it is one slightly thick stroke, or two subtle strokes.
Despite these occasional misclassifications, overall, the results showcase the effectiveness of the proposed model, along with the integration of domain adaptation technique.The high accuracy and robustness demonstrated by the model make it a powerful tool for accurately comprehending Arabic texts.
Since there are very few research papers that focus on diacritic recognition for Arabic text, it is challenging to compare our model directly with others.Most existing OCR models for Arabic text do not handle diacritics, which makes benchmarking against popular Arabic OCR datasets like the Hijja dataset, APTI [15], and the AHCD dataset [14] difficult, as they only contain Arabic letters without diacritics.
Our model's evaluation on the custom-generated dataset shows promising results.It achieves a reasonable overall accuracy and performs well on specific diacritic categories, as indicated by precision, recall, and F1-scores.However, it is important to note that the dataset used for evaluation is synthetic and may not fully represent the diverse variations present in real-world Arabic text.Real-world Arabic text includes a wide range of fonts, handwriting styles, and diacritic placements, making the problem more challenging.

CONCLUSION
The recognition of Arabic diacritic symbols is crucial for accurate comprehension of Arabic texts, especially in domains such as education, communication, and language processing.However, existing OCR systems for Arabic text often lack the capability to recognize diacritics, highlighting the need for research and development in this area.
In this paper, we proposed a novel approach for Arabic diacritic recognition by incorporating a domain adaptation method on the AlexNet architecture, a deep convolutional neural network known for image classification tasks.We generated a custom dataset of synthetic Arabic diacritic images with variations in font styles, sizes, slant angles, and perspective distortions.Our model was trained using this dataset and used a domain adaptation method to enhance its generalization capability across diverse diacritic styles and handwriting variations.
Our model's evaluation on the custom-generated dataset showed promising results of 97% accuracy, but its true performance in realworld scenarios remains to be tested.Unfortunately, the lack of benchmark datasets with diacritic annotations poses a challenge in directly comparing our results with other research papers.Additionally, to fully integrate our diacritic recognition model into an Arabic OCR system, we would need a diacritic segmentation model to segment diacritics from letters before combining the outputs.
In conclusion, this research lays the groundwork for further exploration in the field of Arabic diacritic recognition.Future work should focus on creating benchmark datasets with diacritic annotations to enable direct comparisons among different diacritic recognition models.Additionally, efforts should be made to develop diacritic segmentation models that can work seamlessly with diacritic recognition models, ultimately leading to more comprehensive Arabic OCR systems capable of accurately processing diacritic-rich Arabic texts.

Figure 1 :
Figure 1: Selected diacritics for the recognition model.

Figure 2 :
Figure 2: Sample of the dataset, including digital computer fonts and handwritten fonts.

Figure 3 :
Figure 3: Diagram that showcases how the dataset is split.

Figure 4 :
Figure 4: CNN model architecture with domain adaptation.
Figure 7 demonstrates this scenario; shown are the Fatha and Fathatan diacritics [Figure 7 (a)], and the Shadda with Kasra and Shadda with Kasratan diacritics [Figure 7 (b)], which represent diacritics that have trouble being classified due to an additional stroke.

Figure 5 :
Figure 5: CNN with Domain Adaptation Confusion Matrix.

Figure 7 :
Figure 7: Diacritics that have trouble being classified due to an additional stroke.Case A (a) and case B (b).

Table 1 :
CNN with Domain Adaptation Classification Report.