skip to main content
research-article
Open Access

An Arabic Manuscript Regions Detection, Recognition and Its Applications for OCRing

Authors Info & Claims
Published:13 February 2023Publication History

Skip Abstract Section

Abstract

The problem of Region of Interest (RoI) in document layout analysis and document recognition has recently become an essential topic in OCRing systems. Arabic manuscript layout analysis and OCRing recognition using language detection, document category, and RoI with Keras and TensorFlow are terms of the state-of-the-art that should be investigated. This article investigates the problem of Arabic manuscript recognition problems with respect to in OCRing-based recognition. A new framework architecture, which integrates Fast Gradient Sign Method (FGSM) using Keras and TensorFlow with adversarial image generation during training procedure is proposed. Also, the article tries to improve the OCRing accuracy of the image enhancement, alignment, layout analysis, and recognition using deep learning in multilingual system. RoIs detections will be performed using a custom trained deep learning model using bounding box regression with Keras and TensorFlow. This topic investigates an extension of Page Segmentation Method (PSM) to enhance OCRing parameter modes and enhances Arabic OCRing system accuracy from reinforcement strategy. Therefore, the article achieves a significant improvement of OCRing results due to the three parameters: language identification, document category, and RoI types (Table, Title, Paragraph, figure, and list). This model is based on “region proposal algorithm” as a basis of CNN object detectors to find the number of the RoIs. Therefore, the proposed framework performs three distinctive tasks: (1) CNN architecture for adversarial training, (2) an implementation of the FGSM with Keras and TensorFlow, and (3) an adversarial training script implementation with the CNN and the FGSM method. The experiments on Arabic manuscript dataset including Arabic text, English/Arabic documents, and Latin digits’ datasets, demonstrate the accuracy of the proposed method. Moreover, the proposed framework performs well and succeeded in defending against adversarial attacks or adversarial images. The experimental results on our collected dataset illustrate the novelty of our proposed framework over the other existing PSM methods to be extended and updated to improve the quality of the OCRing system. The results show that the influence of PSM after expanding using the RoI types, language ID, and document/manuscript category can improve the OCRing accuracy. Also, the experimental results show significant performance by the new framework model with accuracy reached to 99% compared to relative methods.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Arabic language was the center of knowledge from the 8th to 14th centuries, and it is known as the Islamic golden age of knowledge [1, 2]. Therefore, millions of wide variety topics of Arabic manuscripts are scattered across many countries in the world. Consequently, due to physical degradation of paper media and decay of the ink, studying and maintaining of these manuscripts are challenging. Classification and categorization of these historical manuscripts are needed to understand culture and historical references [2].

Recently, Deep Convolutional Neural Networks (DCNN) are used in various vision tasks [3]. In real applications, visual objects such as handwriting and musical notes are casting as a sequence recognition problem.

Standard computer vision techniques for objects’ detection (textual or non-textual) mainly classified them into texture-object and non-textual-object approaches. In texture-object detection, sliding windows and a subsequent recognized classifier are used for OCRing recognition [4].

We focus on design methods such as the Hierarchical Agglomerative Clustering (HAC) to cluster each Region of Interest (RoI) object type, and the fact about fast-gradient-sign-method and its used for adversarial image generation to be applied in OCRing systems [5].

Thus, this article is divided as the following: Section 2 introduces the related works in this domain. Section 3 explains the proposed framework and describes the different design stages of the Arabic manuscript and the entirely new Fast Gradient Sign Method (FGSM) method. Therefore, the details of the FGSM method will be described in this section. Section 4 describes the Arabic strategic methodology to segment and recognize RoIs and evaluates the output results with the experimental indicators for Arabic manuscript recognition. In Section 5, experimental results will be discussed. Conclusion and future work will be presented in Section 6.

This article raises and answers the following questions:

  • How can preprocessing dramatically improve Arabic OCR?

  • Are there additional parameters to expand the page segmentation modes (PSM) to enhance the OCRing accuracy?

Skip 2LITERATURE REVIEWS AND RELATED WORKS Section

2 LITERATURE REVIEWS AND RELATED WORKS

Language modeling is based on n-gram statistics to improve recognition accuracy of handwritten and printed texts [6]. This works introduced recurrent connectionist language model to improve LSTM-based Arabic text recognition in videos.

In general, the language model is used to estimate the possibility of word probability to correct some errors related to confusions of characters or correctness of segmentation [7]. Other works proposed for Arabic text detection and recognition steps within video frames, can be found and based on the LSTM [6, 8].

There are many Arabic OCRing with online tool, such as i2OCR,1 OCRNow,2 ABBYY,3 SimpleOCR,4 OCR-Text Scanner (for android), CamScanner,5 and Sakhr.6 Most of these OCRing tools are with limited accuracies for different categories of manuscript types.

Recently, an OCR correction approach based on deep learning with self-supervised is presented [9]. This correction is related to medical terminologies for medical report generation. Accordingly, a domain specific dataset in medical documents was used and evaluated.

Many of existing datasets are used and prepared for OCRing and image processing challenges. Some of these datasets are announced in several ICDAR challenges [10, 11]. Few of these datasets are available and used in layout analysis [12]. In addition to document layout analysis, further work has been recently done to understand the content of documents, such as table detection [13] from images using deep learning approaches [14, 15].

Generally speaking, there are limitations in the number of datasets and small sizes of datasets for Arabic documents to be used in layout analysis.

Before we can investigate our Arabic OCR, document layout analysis should be first implemented using document meta-data, especially the writing style (language script), the skewing angle (orientation), and the style of the documents. The writing style or script style refers to writer style itself and depends on language.

Many of the OCR systems may be more accurate after improving documents during pre-processing stage. Before, the whole processes of the OCR (automatic segmentation and document analysis), we need to clean up the entire images. Therefore, noise elimination is properly an assistant heuristic process to clean the entire images from noises.

Computer vision and image processing methods are used to clean up the input images during preprocessing phase. Therefore, some basic methods such as distance transforming, threshold and morphological operations can be applied to clean up input images through the preprocessing phase of the Arabic OCR engine.

Geometric layout structure combined with the Arabic OCR approaches were used to recognize these images [16, 17]. Recently, documents analysis approaches based on deep learning have been available and can be used to train for layout analysis of English documents [18].

Segmentation is a technique to extract regions from a scanned document [19]. The segmentation techniques are basically characterized into four categories: Hough Transform [20], Projection profiles [21], Smearing [22], and segmentation based on connected components.

An important research presents a new model as a backend computation that supports Arabic Document Information Retrieval (ADIR) with OCR services. Therefore, different services that support OCRing such as document analysis, and information retrieving, including dataset preparation, annotation, and recognition are discussed [23].

Tesseract uses Page Segmentation Methods (PSM)7 to analyze the layout of the image/ document. In this PSM, there are 14 modes to improve OCRing accuracy. The default PSM parameter value is 7 to treat the image/document as a single text line. Therefore, during the layout analysis, the OCRing system needs to pass the PSM mode. Table 1 illustrates the 13 modes of the page segment operations [24, 25].

Table 1.
Mode ValueMode Description
0To orient and describe script detection (OSD) only.
1To automat page segmentation with OSD.
2To automat page segmentation, without OSD, or OCR.
3To automat fully page segmentation, without OSD.
4To use a single column of text of variable sizes.
5To use a single uniform block of vertically aligned text.
6Assuming a single uniform block of text.
7Treating the image as a single text line.
8Treating the image as a single word.
9To treat the image as a single word in a circle.
10To treat the image as a single character.
11To use sparse text, and find as much text as possible.
12To use sparse text with OSD.
13Use the raw line, and treat the image as a single text line.

Table 1. Tesseract Page Segmentation Modes (PSM) [4, 5]

Also, there are additional parameter to work with; known as OCR Engine Mode (OEM). The OEM includes two OCR engines: (1) Legacy engine and (2) LSTM engine. Therefore, four modes can be selected within the OCR engine mode. Table 2 illustrates the four modes of the Tesseract OCR engines [25].

Table 2.
OEM Parameter ModeOEM Parameter Description
0For legacy engine
1For using LSTM model.
2For legacy + LSTM models.
3For default, based on what is existing.

Table 2. Tesseract OCR Engine Modes (OEM)

Skip 3THE PROPOSED FRAMEWORK ARCHITECTURE Section

3 THE PROPOSED FRAMEWORK ARCHITECTURE

The framework architecture of the proposed work consists of three layers, including the CNN layer, the FGSM with Keras and TensorFlow layer, and the transcription layer. The CNN layer extracts feature from each input image during the training using the Arabic MNIST dataset or nay other dataset. Though, the CNN is consisted from diverse types of network construction (CNN and RNN), it can be together trained using one loss function (Mean-squared Error (MSE), categorical cross-entropy, or binary cross-entropy).

Our second layer includes the FGSM for the proposed adversarial image generation. Also, this layer trains and demonstrates the CNN on the MNIST dataset of how to use the FGSM to dupe (fool) the trained CNN to make incorrect predictions.

The preprocessing is an essential and very important phase to improve the image quality and the orientation of the input images. The most important step of the preprocessing is object rotation or RoI skewing. The RoI skewing refers to find out the rotation angle of a piece of text (paragraph, line, or word) in an image. Therefore, we can present the correct object to the OCRing system, in order of obtain higher OCRing accuracy. In this case, the concept of orientation, text script detection, text script orientation, and correct script text orientation will be applied. Therefore, we need an optional parameter to figure out or to obtain fine-grain control on the RoI filtering process to estimate rotation angle of the input image RoIs and the predicted script text in that image. This mechanism can be defined as orientation and script detection [2729].

This illustration is possible after scanning the input document, document alignment, and finding locations of form fields by using computing vision/image processing techniques. After that, we will use the Arabic OCR engine to recognize the entire text.

3.1 Fast Gradient Sign Method (FGSM)

Adversarial attacks may be performed during the recognition of the OCRing systems. Therefore, we need to defend guard against this adversarial attack by using the FGSM. The implementation of the FGSM is based on Keras and TensorFlow. From this point the CNN is trained using Arabic MNIST dataset. The FGSM works as the following procedure (see Figure 1).

Fig. 1.

Fig. 1. The fast gradient sign method procedure.

Arabic OCR is capable of conducting automatic segmentation, and page layout analysis (parse text from complex background) during the processing of the OCR.

Many methods are used to collect and quantify the contents of manuscript, via a series of “texture description,” “black box color descriptors,” and a “shape image descriptor.” Figure 2 shows classic manuscript classification using “hand-crafted features” and the corresponding “deep learning” approach using “convolutional networks” approaches.

Fig. 2.

Fig. 2. The classical layout analysis using handcrafted classification.

A large Arabic handwritten digit's dataset is used. It is composed of 70,000 images in BMP format. The training set includes 60,000 images and the testing set includes 10,000 images. This dataset written by 700 participants, each of them wrote each Arabic digit 20 times. Documents are scanned with 300 dpi, and some noises were added manually.

3.2 Overall Segmentation

The page segmentation (or layout analysis) with its internal two processes (Region of Interest and Text segmentation) is prerequisite to the Arabic document layout analysis. The accuracy of segmentation for each region (RoI), line, words, and characters allow us to obtain high accuracy of any Arabic OCRing systems. Due to complications in some types of Arabic documents, there are complications in regions (text or non-text) and lines, and words styles. To overcome these changes or complications, the proposed work handles first the segments of RoI, then text line segments, and finally words or characters segments. In addition, evaluation impact of the page segmentations will be involved and computed.

The proposed layout analysis receives the input documents (images) and applies preprocessing computer vision tasks to align the input image with the corresponding ground truth. Then, the model detects the existence of an Arabic text in the aligned document. Therefore, the locations of RoI will be located and extracted, and then passed into the Arabic OCR with LSTM deep learning recognition engine. Figure 2 shows the full dataflow diagram of the proposed layout pipeline for the Arabic layout analysis.

The overall proposed model of the Arabic layout analysis (or the page segmentation) is described as the following. The input of the model is the image-based document after the preprocessing module is applied. First, the document is analyzed to obtain the language and the document type (document category) by providing the ground-truth meta data). Second, each type/ domain of the image has threshold parameters. Therefore, the document is analyzed to obtain RoI, and classify this region into text and non-text regions. Also, the text region is segmented into lines of texts, and each segmented line will be segmented into separate words (or characters if needed). The text region will be sent to text recognition process.

To overcome the lack of ground truth data, we collected a document collection of ADLA from open access websites. Most of the documents in ADLA are provided in JPG or PNG, therefore, we created the ground truth using XML or HTML format. Figure 3 illustrates an example of structure representation with the ground-truth representation. Accordingly, the content of the JPG/PNG and the XML/HTML include similar or equivalent format structures.

Fig. 3.

Fig. 3. Analyzing the image and matching the layout with the XML representation to generate annotation of page layout.

The methodology used to create the ground truth is determined as the following. (a) An annotation process for an example Arabic image. The segment of the main text box and the different lines text boxes (surrounding boxes). (b) Bounding boxes of the equivalent segmented lines, i.e., the generated file to create the ground truth by assigning the bounding box for each written line.

Each region in the document can be associated with the corresponding field in the ground truth template. Accordingly, each location of the input document is known from the ground truth.

3.3 Ground-truth Dataset Annotation

Many Arabic manuscripts include additional handwritten knowledge comments (especially Arabic numerals in historical documents). Thereby, transferring this knowledge from these manuscripts into digitized form is very important. Additionally, this knowledge is stored and registered on these manuscripts. This knowledge could be important and transferred to others who could read and understand them. Figure 4 provides samples of Arabic scripts with different orientations and samples writers.

Fig. 4.

Fig. 4. Arabic manuscripts (ancients) samples: an Arabic manuscript, 14 lines per page, text written in Naskh script in black ink, and marginal knowledge's notes (17th Century [30]).

Nowadays, many tools are available [31, 32] to perform labelling (Object labelling or classification) and segmentation of manuscript images, which helps us to create datasets. However, the problem arises when dataset information is not available (or it is not easily accessible) for different manuscript domains. Therefore, it is essential to develop some tools that can help to generate ground-truth data from Arabic websites.

We use and adapt a form labeling tool to create our ground-truth dataset for the five categories of the Arabic manuscripts. This tool can be used to create and modify existing labels (annotations) of the Arabic or Latin images. The tool8 needs Tkinter and Json libraries to work with. Accordingly, we can draw rectangles, polygons, make boxes tight, and assign types of the boxes. Therefore, this tool is used and updated to raise additional language feature and it contributed as well (see Figure 5).

Fig. 5.

Fig. 5. A form labeling tool to create our Arabic ground-truth for two Arabic images of early printed document.

Skip 4REGION OF INTEREST (ROI) STRATEGIC DETECTION Section

4 REGION OF INTEREST (ROI) STRATEGIC DETECTION

As mentioned in the literature review, the RoI or region detection is a technique to extract regions from a scanned document. The RoI segmentation is basically categorized into four different categories: Hough Transform [26], Projection profiles [23, 28], Smearing [29], and segmentation based on connected components. In our case, we first use image processing and vertical projection profile but there were some overlapped characters between lines, so we try to segment lines using connected components techniques and this leads to segment lines with less overlapped characters.

Training module is used to extract and create features after the process of segmenting and extracting RoIs (bounded boxes) from large set of Arabic documents and storing each RoI as a separate labeled image. This label contains the text, which makes the usability of this dataset much easier. Extracting training documents can be automatic through a smart computerized tool or can be done manually using an assistant tool. The RoI layout analysis can be decided to be boxed using (1) The whole image “a full page within its margins,” (2) paragraphs, (3) lines, (4) words, or (5) characters (see Figure 6). Accordingly, each region of RoIs has a separate algorithm to deal with. A trial and error approach are used to determine the kernel size of each algorithm.

Fig. 6.

Fig. 6. RoIs' layout analysis segmentation for four types of Arabic manuscripts.

Therefore, a simple procedure is applied as the following algorithm.

Objects or RoIs detection are performed using a custom trained deep learning model using bounding box regression with Keras and TensorFlow. This model is based on “region proposal algorithm” as a basis of CNN object detectors to find number of RoIs. Therefore, this algorithm is used to search for any document or image as a selective search to identify a potential RoIs that could be. These regions are used to extract output features from a pre-trained CNN, and therefore, are fed into a final classifier to decide (such as SVM). In our work the locations of the RoIs are treated as bounding box, whereas the SVM classified the class label for the RoIs bounding box. To answer the question of RoIs detectors, using bounding-box and class label predictions (document type) for RoIs in an image for the dataset. Therefore, the network architecture is adapted according to the following procedure:

We use ADLA dataset composed of 300 images, and the corresponding bounding-box coordinates (CSV file). Also, we can use an additional dataset. The annotated Arabic images of the dataset will generate the equivalent XML/HTML tagging codes using an annotation algorithm to identify each segment of the ROI (separate lines or separate words). Hence, an evaluation automated tool is needed to evaluate “How well an image page is annotated?” form the dataset [35]. Many of missed information of the metadata description, for example in PubLayNet dataset, historical and editorial details lead to minimize the annotation quality from 99% to be 90% [18].

An adaptive approach that ensemble two classifiers; one based on Arabic image manuscripts, and second based on Arabic textual content to classify and support the ground truth.

4.1 Language Detection and RoIs Orientation

Building a model to enhance OCRing should start with language detection about each RoI. That is the holy grail to enhance the Arabic OCRing process. Little work has been done on language detection during OCRing systems. However, language detection and recognition in natural manuscripts of Arabic text has not received attention. There is no dataset that covers language detection during OCRing systems. Algorithm 4 describes the language detection.

Figure 7 illustrates three examples of Arabic manuscript images. The first one is a paragraph of Arabic text from our translated book “Distributed Systems9” at King Abdulaziz university center. The second image is the same paragraph of Arabic text, and rotated 90 degrees clockwise.

Fig. 7.

Fig. 7. Top left is the Arabic text document Middle down is the same Arabic text rotated clockwise. Bottom down is another normal English text. The last is the same paragraph of English text, this time rotated clockwise.

4.2 Document Type and RoIs Classification

The proposed solution trains the collaborative classifier based on the annotated dataset (ground-truth dataset). Empirical results indicate that the above combination achieved better accuracy. As a result, we implemented two classifiers, namely, an Arabic Document Layout Analysis (ADLA) classifier that analysis the layout of an Arabic documents, and the Arabic text classifier that considers its textual content. Figure 8 shows the training process of these two classifiers based on the ADLA dataset.

Fig. 8.

Fig. 8. Training the classifiers based on the dataset.

Each document within the training set will be converted into RoIs images, one image per RoI based on dataset labeled category methodology. These labeled RoIs categories are fed as a training data to a convolutional neural network classifier. Therefore, large volumes of training data are required to train the CNN visual features. Accordingly, CNN-based image classifiers depend on transfer learning with low level of pre-trained features, and high-level layers trained on the large dataset of interest. Therefore, the VGG-16 architecture is used (trained on ImageNet dataset) by excluding the final layer as feature extractor [36]. Following this scenario, a fully connected layer is appended to the CNN with output layer of size language categories (Ƚ1). At this final layer, the weights are trained using the labeled data.

The next step is the Arabic text classifier to recognize the text from the documents (or RoIs). Since tokenization process is used to represent it into n-gram vector by term-frequency and inverse-term-frequency (TF-IDF). Then SVM based on sparse vector representation will be trained based on the resulting labeled feature vectors.

Finally, at the level of language detection, given a new document (x) the fusion model estimates the probability (P (y | x)) that the RoI or the input document (x) belongs to category y (for all y ϵ Ƚ1). Figure 9 shows such a scenario for input x by computing the probability {P (y1i | x1i): y ϵ Ƚ1} to estimate the language category of x.

Fig. 9.

Fig. 9. Estimation of language, document category, RoI types using fusion classifiers.

Algorithm 5 describes the document category type (Early printed, Printed, Thesis, Calligraphy, or Handwritten) or RoI type (Title, Text, Table, Figure, List).

An Attention-based Encoder Decoder (AED) architecture is used to implement the proposed work (Figure 9). The encoder extracts the visual features of the input images into RoIs structures. The decoder recognizes the RoIs contents by extracting features sequences. The data analysis includes three main problems: (1) Language detection, (2) Manuscripts categories, and (3) RoIs’ content recognition. The ADLA provides an Arabic dataset available at some time in PNG or JPG and XML format. The presented approach in this article based on generation of Arabic documents automatically annotated with the location of layout RoIs using the ADLANet. This is in addition to existing state-of-the-art object detection using computer vision successfully reproduces the annotated set. The five category titles are challenging due to heterogeneity in different ways that domains’ titles are present in the annotated layout analysis.

To create a multi-class Arabic manuscript detection model based on Keras and TensorFlow, the Visual Geometry Group-16 layers (VGG-16) will be modified. Therefore, the network head of the new architecture will be updated by removing the fully connected layer head, and construct a new fully connected layers’ head using two branches.

First branch is responsible for the RoIs’ predictions. It consists of a series of fully connected layers with four neurons related to the bounding box coordinates of the RoI. This coordinates include top-left (x, y) coordinates, and bottom right (x, y) coordinates, in addition to a sigmoid function for each of the four neurons.

Second branch responsible for the Arabic manuscript class label predictions. This branch includes a soft-max classifier at the end.

Finally, this fine-tuning for the new architecture will be trained using our custom ADLA dataset for Arabic manuscript RoIs detection (five categories). Accordingly, we have to opt for two decisions; first to predict the manuscript category types. Second, the RoIs class type (title, text, table, figure, and list) [26] within each image. Therefore, the proposed model detects where the RoI is in an input manuscript, and predicts what is the different RoIs types are. In our case, we are using a subset of the Arabic manuscripts dataset, which can be used to train manuscript and RoIs types detection models. Specifically, the dataset designed as the following classes: early printed, printed, thesis, news, and handwritten.

To implement this structure, the Arabic manuscript ADLA dataset has two subdirectories: annotation directory contains five CSV files (one for each category), plus our RoIs locations (bounding boxes coordinates). Therefore, the structure includes image-name, x-start, y-start coordinates, x-end, y-end coordinates, and category label.

4.3 RoIs Architectures

The following description organizes the rest of this current work, in the OCRing table recognition. The next subsection describes the architecture of the Arabic table recognition and output aligned data elements. Performance evaluation will be illustrated at the end of this section.

The main objectives of this work are detecting RoI table of text from the scanned image, discovering data cell elements of the RoIs with the multi-column and multi-row, OCRing such RoIs of the extracted elements, and finally building the recognized elements and building in a data-frame table. The Arabic dataset of printed images are collected by scanning pages of the FCIT journal research papers. The dataset is collected during January 1, 2001 to December 31, 2020. Each image for the collected data includes table with multi-columns and multi-rows. To start, the image will be fed to the processing phase of the proposed model (Figure 10(a)). Therefore, we need to extract the RoI of the table itself (Figure 10(b)). Once, we have the RoI segmented table, text localization for each element in the table will be applied to generate the text-element bounding boxes with their coordinates.

Fig. 10.

Fig. 10. The largest foreground region is the “Omission and arbitrary failures” are interested and want to OCRing.

4.3.1 Table Architecture and Detection Method.

The Hierarchical Agglomerative Clustering (HAC) is used to aggregate the data elements according to the distance between the elements [38]. Therefore, the suitable distance can be determined according to most similar data elements. For example, if the RoI is predicted as a table, then the organization of this idea is contributed as the following two algorithms.

The following algorithm includes the main step of the proposed work:

4.3.2 Table Clustering HAC Method.

At the output level, it is applying a closing morphological operation to detect large blocks of text. The largest foreground region is the “Omission and arbitrary failures” table we are interested in and want to OCRing.

The HAC is used to cluster each element RoIs in the table. The initial element (RoI) is determined by the x-coordinate of the RoI element bounding box as the initial cluster. The model treats each RoI element as a single cluster, and then computes the distance between different clusters. Therefore, a threshold is defined first and then used to compose the distance between RoIs elements. If the distance is less than the threshold (similar clusters), then the two RoIs elements are successively merged.

The proposed work of multi-column will accept an Arabic input image (Figure 10(a)), detect the RoI table region data (Figure 10(b)), extract it using threshold value (Figure 10(c)), and then OCRing it, taken into rows and columns ordering way. Accordingly, the output result will be displayed in a nicely formatted table, based on tabulate library [39].

The proposed model applying HAC to the x-coordinate value results that have similar (identical or near-identical for x-value). Consequently, we assume that the text belonging to the same column is similar; therefore, columns are combined together. If RoIs tabular data have large number of white spaces between each row, then the default threshold will be increased. Otherwise, if there is less white space between each row, then it decreases the threshold accordingly.

The most significant issues during the layout analysis can be observed during the overall selections of the 13 PSM's parameters. Figure 11 shows the layout analysis of the Arabic manuscript that includes two regions (RoIs: Table and Text). Here, in the table level, the analysis includes the results of the analysis (RoIs table detection), in addition the entire words’ RoIs to send to the OCRing recognition phase.

Fig. 11.

Fig. 11. The complete training process of combining normal images and the generated adversarial images together.

4.4 Classifying Handwritten Arabic Numerals

The experiments are implemented using FGSM of a neural network. The FGSM calculates the gradients of a loss function with the input scanned image and generates an adversarial image (identical to the original) to maximize the loss.

We have expressed the FGSM using the following criteria: (1) \( \begin{equation} {{\mathrm{X}}_{{\rm{Adversarial}}}} = {\mathrm{X}} + \varepsilon .{\rm{ Sign }}({\blacktriangledown _x}{\mathrm{J}}(\theta ,{\mathrm{X}},{\mathrm{Y}})) \end{equation} \)where:

  • ɛ: represents small value that the human eye cannot detect, whereas large enough it can fool the neural network,

  • ▼: represents the gradient of cost function relative to X,

  • X: is the original input image,

  • Y: represents the ground-truth label, equivalent to the input image,

  • J: is the loos function,

  • θ: represents our neural network model.

Therefore, the training procedure can be modified to incorporate adversarial examples by the following step (see Figure 16).

These experiments are performed using Python 3.9 on Spyder 4. We have completed the performance test methodology based on the Average Accuracy Rate (AAR) using the following Equation (2): (2) \( \begin{equation} {\rm{AAR\ }} = {\rm{\ }}\frac{{\# \ of\ numerals\ samples\ correctly\ recognized}}{{Total\ number\ of\ numerals\ samples}}. \end{equation} \)

An attention-based encoder decoder (AED) architecture is used to implement the proposed work (Figure 12). The input of the proposed model receives an image, extracts related features, making prediction features for each RoIs for this image during the encoder of the system.

Fig. 12.

Fig. 12. AED architecture. The encoder is a CNN, which extracts a feature sequences for the ADLA input images. The decoder predicts a label for each RoI and recognizes the equivalent sequences.

The encoder extracts these visual features of the input images into RoIs structures through convolutional networks. The decoder recognizes these RoIs contents by mapping between these features sequences and predicts the equivalent labels sequences. The data analysis includes three main problems: (1) Language detection, (2) Manuscripts categories, and (3) RoIs’ content recognition.

Skip 5EXPERIMENTAL RESULTS Section

5 EXPERIMENTAL RESULTS

As illustrated before, we aim to have an idea about the range of the PSM's and the OEM's decoding parameters that are needed to improve the OCRing accuracy. We independently evaluate them in terms of the language of the manuscripts, the manuscripts categories, the RoIs types, table detection and recognition, and RoIs detection and recognitions.

For the language of the manuscript, the OCRing is evaluated by using two languages (Arabic and English images). For this parameter (language ID), we observed there is a strong correlation between the language ID and the RoIs orientation (as shown in Figure 7). Therefore, any OCR needs the two parameters of the manuscript language and manuscript orientation. The experimental results show that the improvement of the final accuracy with the decreasing in the word error rate (WER) is due to the manuscript category and language parameters. To test the languages of the manuscripts with the related orientations, we used a small number of images sets from two different references books (English and Arabic). The obtained results show a strong correlation between language parameter and accuracy (WER) of the two different categories. Values with lower WER yield better recognition rates for both Arabic and English OCRing (See Table 3).

Table 3.
LanguageTesting ImagesAverage AccuracyAverage WER
EnglishDistributed Book99.65%0.35%
EnglishSilk Brocade98.35%1.35%
ArabicDistributed Book94.74%5.26%
ArabicSilk Brocade93.27%6.73%

Table 3. Average Accuracy and WER for Both English and Arabic OCRing Results

The most important observation is that the integration schema with the other observed parameters for further improvements of the suitable OCRing results.

To study the other parameters that are related to RoIs types, we evaluate the different used parameter methods during the development of testing phase in terms of WER and therefore the accuracy. The results are illustrated in Table 4, which shows very close results of the language parameter with some improvement according to the language selection.

Table 4.
RoI CategoryMethodFast-RCNNMask-RCNN
TableZero-Shot48.76%49.50%
TableFine Tuning70.35%72.46%

Table 4. Arabic OCRing Results Using Fast and Mask RCNN

However, when we applied the new algorithms of table handling (detect and analyze then recognize) based on the HAC and the FGSM, we obtained considerable improvement during segmentation and recognition rates. This improvement depends on the used algorithms with the suitable RoIs parameters.

A look at the integrated evaluation metrics of the results, an impact of the learning rates, and the variation of the hidden layers with the Fast-RCNN and Mask-RCNN. The WER (losses) are increased while learning rates are decreased. Also, the Mask-RCNN results are surprising with the achievable accuracy. Table 4 shows that decreasing rates in the learning process reduces the recognition accuracy.

The used testing environment in this article is Intel Xeon 2.6 GHz, 128 G. Bytes RAM, and GPU with Nvidia using Windows 10 operating system. The experiments are conducted on the Arabic and Latin printed, handwritten and table dataset. Each image is analyzed (layout analysis) and segmented into RoI regions, taken into consideration all the 15 parameters of layout analysis. Consequently, the proposed solution is tested using Tesseract developed by Matthias Lee [38] to our proposed OCRing model.

5.1 Table Detection and Recognition

The overall results analysis is simply correct with few mistakes (see Figure 13). We are going to speed up by using extra time focusing on improving the OCRing accuracy based on the PSM parameters. The most accurate parameters are at mode 4 and 11. Therefore, the output of the OCRing process will store the recognition results in a simple CSV script on a table serialized to store recognized table texts.

Fig. 13.

Fig. 13. Sample output of applying multi-column layout. Each column is detected, and each cell in a row is assigned to a particular column.

A small dataset of Arabic data extracts such as RoI tables from translated book from English to Arabic. We tested a Fast-RCNN and a Mask-RCNN on this data (as a zero-shot without training). Then, the fine-tuned model was performed using the presented HAC method. The fine-tuned HAC method achieves the state-of-the-art performance (see Table 4).

5.2 Digit and Text Detection and Recognition

The FGSM method of the adversarial attack beside the CNN architecture are used to classify and recognize the Arabic numerals images. The average accuracy rate calculated for each Arabic digit (0–9). It is evident indicates that the height accuracy is achieved for Arabic digit “7” (100%), whereas the lowest accuracy is achieved for the Arabic digit “1” (94%).

We use the Arabic OCRing recognition process to verify the results and observe the variation of the two cases (1) losses during training and (2) OCRing accuracy. The used parameter optimizer with epsilon values (0.0001–0.001) and the learning rate varieties are used. Table 5 shows the different scores of the OCRing recognition task within the considered variations.

Table 5.
Used CategoryLearning RateLossesAccuracy
Arabic NumeralsLR = 1e-20.03400.9891
LR = 1e-30.03730.9874
LR = 1e-40.03800.9871
Handwritten imagesLR = 1e-20.04500.9889
LR = 1e-30.03100.9909
LR = 1e-40.11080.9651

Table 5. Arabic OCRing Results with Considered Variations

Five experiments are implemented to investigate: (1) the language of RoIs detection, (2) the input Arabic image category detection, (3) the table detection, (4) title text detection, (5) the figure detection, and (6) paragraph, line, and word detections.

We trained a Fast-RCNN and a Mask-RCNN mode on our proposed work using the detection implementation.

To evaluate the behavior of the Arabic detection, Figure 10 shows the rendering of the input table image. The table has three columns, single header row, and seven body rows. The table header has a simple structure.

Our model can detect table RoI and text RoI, making no error in layout analysis (both for text, table, and figure detections). The seventh body rows are missing three cells detection (first, fourth, and fifth). However, at the level of the header of the table; it includes a detection title error (because the header is written with same font and style for the other table elements.

5.3 RoI Layout Type (Sub-type categories)

The structure ground-truth of the dataset includes many different RoIs’ categories. We collected five categories shown in Table 6. They are based on the following description: Title, Text, Table, Figure, and List.

Table 6.
RoIs TypeRoI Description
1TitleArticle title, section title, table title, and figure title.
2TextParagraph in main text, table caption, table footnote, author affiliation.
3TableMain body of table elements, composed from multiple rows and columns.
4FigureMain body of figure. The whole figure is annotated as a single object.
5ListNested lists (i.e. child list under an item of parent list).

Table 6. Categories of RoIs Layout Included

Figure 14 illustrates an example of the layout analysis of an Arabic image. This image is analyzed by the proposed work, and therefore segmented into the detected three types of RoIs: (1) Title: Block of text (the first text line and the bounding box), (2) Paragraph text: Block of group of text lines, including six text lines, (3) Image: Including an image is associated with a bounding box, and (4) Block of text of the second paragraph with five lines.

Fig. 14.

Fig. 14. The layout analysis using the proposed work.

The performance of the Fast-RCNN and the Mask-RCNN models on our test is depicted in Table 7. The intersection over union (IoU) of the ROIs bounding boxes generated layout analysis with accuracy >90%.

Table 7.
RoI CategoryFast-RCNNMask-RCNN
Title84.4%85.2%
Text91.0%91.6%
Table95.2%96.3%
Figure93.7%94.5%
List88.0%88.6%

Table 7. Average Accuracy of Layout Analysis

Table 7 illustrates the average accuracy of the Arabic printed documents for layout analysis results, using Fast and Mask RCNN models. As shown by accuracy, the model can generate layout analysis with good results. We think some of the noise in the Arabic manuscript generate errors. We will try to continue improving the quality of the proposed work.

5.4 The Proposed Convolutional Networks

A systematic view of our model is depicted in Figure 14. It is composed of three main modules: Features extraction module, Features merging module and Output layer module. The features extraction module can be a convolutional network pre-trained on VGG-16 or on ImageNet datasets. In our experiments the VGG-16 model is adopted, then feature maps are extracted after second pooling.

We used CNN pre-trained for classification. Let us try to see how it works using visual geometry group (VGG-16) as shown in Figure 11. Today, the CNN is considered to be deeper and better in training. At our classification for ADLA layout analysis the input image is represented in 224 × 224 × 3 tensor. Officially, we used five convolutional layers. Then, flatting the output of the last CNN layer as a rank of 1 tensor. The final block of the network uses fully connected layers. The first layer starts to learn edges. To activate more RoI shapes, the second layer finds patterns in the RoI boundaries.

The convolutional features have much smaller dimensions than the original document (The width and height decreased, because the pooling mechanism), but increased in depth (the depth is increased due to number of filters).

5.5 Real-time RoI Detection

Also, this work applies the deep learning-based object detection to real time in video streaming and video files. Therefore, deep learning, RoI detection for video streaming, and measuring the frame per second rate will be applied in this work. Webcam using video stream will be used in an efficient way to apply RoI detection to each frame.

Our next work is to deal with these RoIs (bounding boxes) and classify them into our desired ADLA five categories of the proposal ground-truth.

The simplest method would be to take each region, crop it, and path it through the pre-trained base network model. Then, it can use the extracted features as input for an object classifier, by reusing the existing convolutional feature map. This is done by extracting fixed-sized features for each RoI to classify them into a fixed number of the five categories of ADLA dataset.

The simplest approach that can be used in object detection [32] is to extract the convolutional features into a fixed size by resizing it by kernel.

The manual technique is done using a developed computer tool, and the user objects the words with its coordinates. These coordinates are defined as (xi, yi, xj, yj) where (xi, yi) are the coordinates of the upper left corner in the bounding box of the objected-word, and (xj, yj) are the coordinates of the lower right corner in the bounding box of the objected-word. Figure 13 shows a screen shot for the proposed tool to create ground truth.

The proposed tool makes the manual efforts to be more efficient in time and accuracy than the ordinary methods. In the first training dataset there are 28 used books, the overall of the selected object words from these books includes 84,000 words. The output of this step (for each document file) presented by two generated files: (1) ASR file: describes the automated segment region that contains all keyword box locations (x, y, width, height), and (2) TXT file: contains the segmented text itself.

The selected dataset is developed10 to be used in evaluation of document analysis, binarization, writer identification and text line segmentation. The dataset includes 150 annotated documents, regarding the complexity of the document's layout. Therefore, regardless of the writing of these documents, the ground truth is based on polygons (not straight line). The ground truth of this dataset includes (1) main text, (2) decorations, and (3) text comments.

The ADLA is organized in the proposed work as a tree. Inside of the ADLA, there are five classes of the five categories: Early printed, Printed, Thesis, Calligraphy, and Handwritten. Within each of those categories there is a huge number of images pertaining to the respective class.

A full-blown CNN training script using Keras and additional Python library are implemented during the training phase and testing phase (see Figure 15).

Fig. 15.

Fig. 15. The used VGG architecture [32].

Fig. 16.

Fig. 16. ADLA splitting data into training and testing sets.

Python code and TensorFlow are used to handle, invoke, and define their functions libraries for reading the images dataset. The ratio of the training set relative to the testing set is 80%: 20%. Therefore, Keras and TensorFlow are used with the OCR model. We obtained 100% for training set and 96% accuracy during the testing set. The history of the training is illustrated in Figures 17 and 18.

Fig. 17.

Fig. 17. Our training history in the testing set. It shows few signs of overfitting, which means that Keras and TensorFlow does well for OCR task.

Fig. 18.

Fig. 18. Our training history in the testing set. It shows few signs of overfitting, which means that Keras and TensorFlow does well for OCR task.

5.6 Experimental Results

We have a simple dataset structure for this test, consisting of the following ADLA description:

(1)

ADLA categories contain all the five categories that covers all types of Arabic documents: Early printed, printed, News, Thesis documents, calligraphy, and Handwritten documents. All these categories are scanned with JPG or PNG format.

(2)

ADLA ground-truth includes a simple file structure of our templates using PNG format, in addition to the XML description and metadata description (if needed).

The collected dataset is composed of the five categories of Arabic documents, they are irregular and complex. Most of them have imbalance illumination and their colors are not uniform. The collected documented images contain 300 images; each category includes 50 document images. Therefore, the size of the images is not uniform and not equal in dimensions. Accordingly, precision and recall are combined to calculate the metric of F-measure (F) using the following standard criteria: (3) \( \begin{equation} {\rm{Precision\ }} = {\rm{\ }}\frac{{TP}}{{TP + FP}}, \end{equation} \) (4) \( \begin{equation} {\rm{Recall\ }} = {\rm{\ }}\frac{{TP}}{{TP + FN}}, \end{equation} \) (5) \( \begin{equation} {\rm{F}} - {\rm{Measure\ }} = {\rm{\ }}\frac{{\left( {1 + {I^2}\ } \right)\ x\ Precision\ x\ Recall}}{{{I^2}{\rm{\ x\ }}\left( {{\rm{Precision}} + {\rm{Recall}}} \right)}}, \end{equation} \)where:

  • TP represents the number of the extracted Arabic text regions are segmented correctly (True Positive).

  • FP represents the number of the extracted Arabic text regions are segmented incorrectly (False Positive).

  • TN represents the number of the extracted Arabic non-text regions are segmented correctly (True Negative).

  • FN represents the number of the extracted Arabic non-text regions are segmented incorrectly (True Negative), i.e., the number of undetected regions with non-text.

  • I represents an impact factor between precision and recall, and by default is sitting by value 1. This factor varies according to the Arabic images’ categories and the illumination of the document.

Table 8 shows the experimental results, and it is evident that decreasing of precision led to the complexity of the language, category type, and the orientation (skewing) of the layout and document image quality. Whereas the increasing in measured value is relatively related to image quality, noise correction, and coloring stability.

Table 8.
Image CategoryDocument Script Evaluation
LanguageDoc. CategoryRoIs
Early printed999999
Printed10099100
Thesis10098100
Calligraphy949293
Handwritten928987

Table 8. Table Layout Analysis for Arabic Manuscripts

The tested data used here is taken from many collected sources. One of them comes from the translated book (Distributed design > 1200 pages) into Arabic in KAU university. The second is coming from a published book in English and Arabic (The Book of Silk Brocade ). The third is private dataset, this data used in manuscript categories verification (see Figure 19).

Fig. 19.

Fig. 19. Our OCRing recognition of the testing phase, using English and Arabic documents (The Book of Silk Brocade).

We evaluated five fine tuning approaches. Table 6 illustrates the comparative performance between the five categories. Fine tuning the printed, thesis, and early printed models outperform other fine-tuned models. The only exception is that fine tuning pre-trained calligraphy and handwritten models with CNN network. In addition, the improvement of line and word detection using RCNN model is relatively low in line and word segmentation in case of calligraphy documents. We think this is because of less ground-truth knowledge and more substantial accuracy of images.

At the beginning, we used a simple network consisting of input 3,072, two hidden layer 1,024 and 512, and output layer with five categories using Keras. The CNN architecture using Keras is defined with one input layer (using simple network: 32 × 32 × 3 = 3,072), multiple hidden layers, and one output layer. The final output layer decides and classifies the class label of the Arabic manuscripts (one from the five categories). Therefore, categorical cross-entropy was used as a loss for all networks trained to perform classification. Then, the deep learning process was involved using learning rate 0.01, 80 epochs with batch size 32 then 64 (during the simple network). Therefore, we trained a deep learning model using our training data and a compiled model. Accordingly, we used the testing data to predict the manuscript categories and generate a classification report.

It is clear now that our Arabic manuscript detector correctly classified the different categories of the manuscripts during the training and testing set with 100% accuracy. The results illustrate the advantage of printed, thesis and early printed in domain document layout analysis and OCRing recognition (Tables 9 and 10).

Table 9.
New ParametersAccuracy
CV-NewFast-RCNNMask–RCNN
LanguageEnglish100a
Arabic99.4
Skewing or OrientationEnglish100
Arabic99.2
Arabic ManuscriptsEarly printed99
Printed100
Thesis100
Calligraphy93
Handwritten87
RoIs typesText90.0091.0%95.6%
Table91.7393.2%95.3%
Figure90.6593.7%94.5%

Table 9. Fine Tuning Performance Analysis Between the Four Parameters of the OCRing Framework

Table 10.
RoIs typesCV-NewFast-RCNNMask–RCNN
WERAccuracyWERAccuracyWERAccuracy
Text10.0090.009.0%91.0%4.2%95.8%
Table8.2791.736.8%93.2%4.7%95.3%
Figure9.3590.656.3%93.7%5.5%94.5%
  • It should be noted that the Mask-RCNN can achieve better accuracy for all types of the RoIs.

Table 10. Comparison of Our Proposed Models with the State-of-the-art

  • It should be noted that the Mask-RCNN can achieve better accuracy for all types of the RoIs.

Skip 6CONCLUSION Section

6 CONCLUSION

For this work, we proposed a powerful Arabic layout analysis to be used in Arabic OCR recognition system. Fourteen Layout Analysis Modes (LAMs as parameters) are presented using Page Segmentation Methods (PSM) and used to improve the OCRing recognition. This in addition to the OCR Engine Mode (OEM) using four parameters. We have reviewed the recognition work of the proposed Arabic documents, based on the proposed layout analysis and computer vision module with high quality of recognition accuracy. Based on computer vision approach, noise removing processes have been performed, skewing detection and de-skewing correction, with high quality of obtained results. Accordingly, the proposed solution overcame the presence of overlapping, connected text, and irregular gaps. The obtained results are recognized with more padding in 25% of the skewed images. The Arabic OCR accuracy was recognized well in the standard size of the Arabic words and failed for smaller size of the Arabic words, due to similarity with the ground truth. In this case, our Arabic OCR system is far from perfect to the small size of the written manuscript.

Many approaches are presented to enhance Arabic manuscript region detection and OCRing accuracy. The high quality of the ground-truth dataset is recommended. In addition, existing state-of-the-art RoIs detection algorithms such as HAC, FGSM, and computer vision methods could be merged with the RoIs category in this stage. Furthermore, these RoIs algorithms are delivered to detect documents: layout, titles, texts, tables, and figures and therefore enhance the OCRing accuracy. We investigated an extension of Page Segmentation Method (PSM) to enhance OCRing parameter modes, and enhance learning system from reinforcement strategy, which effectively improves OCRing accuracy. It achieved significant improvement of OCRing results due to the three parameters: language identification, document category, and RoI types (Table, Title, Paragraph, figure, and list). In average it improved the accuracy from 89% to 96% for the whole document.

We implemented an Arabic OCR capable to locate, extract, and recognize the Arabic/Latin text for Arabic documents with high accuracy tended to 100%. We plan to automate the ADLA as a large dataset to solve other document analysis with deep neural network in future works. Other analysis includes relations between the layout of document elements that can be used to create a logical structure of the documents.

In the future, we plan to extend our PSM framework to include more manuscript's objects’ categories. Besides, we consider a fine-tuned annotation format rather than the bounded-box of rectangular boxes.

Footnotes

REFERENCES

  1. [1] Adam K., Baig A., Al-Maadeed S., Bouridane A., and El-Menshawy S.. 2018. KERTAS: Aataset for automatic dating of ancient Arabic manuscripts. Int. J. Doc. Anal. Recogn. 21 (2018), 283290. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Johnston M. and Van Dussen M.. 2015. The Medieval Manuscript Book: Cultural Approaches. Chapter 1—Introduction: Manuscripts and cultural history, Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Shi B., Bai X., and Yao C.. 2016. An End-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 11 (2016), 22982304. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Rosebrock A.. 2021. Optical Character Recognition with OpenCV, Tesseract, and Python: Introduction to OCR Bundle, 1st ed. Pyimagesearch.Google ScholarGoogle Scholar
  5. [5] Rosebrock A.. 2021. Optical Character Recognition with OpenCV, Tesseract, and Python: OCR Practitioner Bundle, 1st ed. Pyimagesearch.Google ScholarGoogle Scholar
  6. [6] Yousfi S., Berrani S., and Garcia C.. 2017. Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos. Pattern Recogn. 64 (2017), 245254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Doval Y. and Gómez-Rodríguez C.. 2019. Comparing neural- and N-gram-based language models for word segmentation. J. Assoc. Inform. Sci. Technol. 70, 2 (2019), 187197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Yousfi S., Berrani S.-A., and Garcia C.. 2015. Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos. In Proceedings of the International Conference on Document Analysis and Recognition. 1026103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Karthikeyan S., Herrera A., Doctor F., and Mirza A.. 2020. An OCR post-correction approach using deep learning for processing medical reports. J. Trans. Circ. Syst. Video Technol. Retrieved from https://ieeexplore.ieee.org/document/9448197.Google ScholarGoogle Scholar
  10. [10] Clausner C., Papadopoulos C., Pletschacher S., and Antonacopoulos A.. 2015. The ENP image and ground truth dataset of historical newspapers. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR’15). IEEE, 931935.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Clausner C., Antonacopoulos A., and Pletschacher S.. 2017. ICDAR 2017 competition on recognition of documents with complex layouts. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE, 14041410.Google ScholarGoogle Scholar
  12. [12] Zhong X., Tang J., and Jimeno-Yepes A.. 2019. PubLayNet: largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’19).Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Tran D. N., Tran T. A., Oh A., Kim S. H., and Na I. S.. 2015. Table detection from document image using vertical arrangement of text blocks. Int. J. Contents 11, 4 (2015), 7785.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] He D., Cohen S., Price B., Kifer D., and Giles C. L.. 2017. Multi-scale multi-task FCN for semantic page segmentation and table detection. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE, 254261.Google ScholarGoogle Scholar
  15. [15] Kavasidis I., Palazzo S., Spampinato C., Pino C., Giordano D., Giuffrida D., and Messina P.. 2018. A saliency-based convolutional neural network for table and chart detection in digitized documents. Retrieved from https://arXiv:1804.06236.Google ScholarGoogle Scholar
  16. [16] Liang X., Cheddad A., and Hall J.. 2021. Comparative study of layout analysis of tabulated historical documents. Big Data Res. 24 (2021). .Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Saberi A., Motamedi S., Shamshirband S., Kausel C., Petković D., Endut E., Ahmad S., Hashim R., and Roy C.. 2016. Evaluating the legibility of decorative Arabic scripts for Sultan Alauddin mosque using an enhanced soft-computing hybrid algorithm. Comput. Hum. Behav. 55 (2016), 127144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Binmakhashen G. M. and Mahmoud S. A.. 2020. Document layout analysis: a comprehensive survey. ACM Comput. Surveys 52, 6 (Jan. 2020), 136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Al-Barhamtoshy H. M. and Rashwan M. A.. 2014. Arabic OCR segmented-based system. Life Science J. 11, 10 (2014), 12731283.Google ScholarGoogle Scholar
  20. [20] Al-Barhamtoshy H. M., Jamb K., Ahmed H., Mohamed S., Abdou S., and Rashwan M.. 2019. Arabic calligraphy typewritten and handwritten using OCR system, Biotech. Res. Commun. 12, 2 (2019), 283296. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hesham A. M., Abdou S., Badr A., and Al-Barhamtoshy H. M.. 2017. Arabic document layout analysis, pattern analysis, and applications. Retrieved from .Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Al-Barhamtoshy H. M.. 2016. Toward large-scale image similarity discovery model. In Proceedings of the 2nd International Conference on Advanced Technologies for Signal& Image Processing (ATSIP’16). Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Al-Barhamtoshy H. M., Jambi K. M., Abdou S. M., and Rashwan M. A.. 2021. Arabic documents information retrieval for printed, handwritten, and calligraphy image. IEEE Access. Retrieved from https://ieeexplore.ieee.org/abstract/document/9380437.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Rosebrock A.. 2020. OCR a document, form, or invoice with Tesseract, OpenCV, and Python. Retrieved from https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/.Google ScholarGoogle Scholar
  25. [25] Rosebrock A., Thanki A., Paul S., and Haase J.. 2020. OCR with OpenCV, Tesseract, and Python, Pyimagesearch. Retrieved from https://www.pyimagesearch.com/books-and-courses/.Google ScholarGoogle Scholar
  26. [26] Schreiber S., Agne S., Wolf I., Dengel A., and Ahmed S.. 2017. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In Proceedings of the14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE, 11621167.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Staar P. W. J., Dolfi M., Auer C., and Bekas C.. 2018. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery—Data Mining (KDD’18). ACM, 774782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Clausner C., Antonacopoulos A., and Pletschacher S.. 2019. ICDAR2019 competition on recognition of documents with complex layouts—RDCL2019. Retrieved from https://www.primaresearch.org/www/assets/papers/ICDAR2019_Clausner_RDCL2019.pdf.Google ScholarGoogle Scholar
  29. [29] Ren S., He K., Girshick R., and Sun J.. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. Retrieved from https://arxiv.org/pdf/1506.01497.pdf.Google ScholarGoogle Scholar
  30. [30] Ancient Islamic Manuscripts. 2020. Retrieved from https://foliosltd.com/product/arabic-manuscript-work-on-philosophy-and-mysticism/.Google ScholarGoogle Scholar
  31. [31] PDFminer. 2021. Retrieved from https://github.com/euske/pdfminer.Google ScholarGoogle Scholar
  32. [32] Astanin S.. 2021. Tabulate PyPI. Retrieved from https://pypi.org/project/tabulate/.Google ScholarGoogle Scholar
  33. [33] Mukherjee S., Oates T., DiMascio V., Jean H., Ares R., Widmark D., and Harder J.. 2020. Immigration document classification and automated response generation. Retrieved from https://arXiv:2010.01997v1.Google ScholarGoogle Scholar

Index Terms

  1. An Arabic Manuscript Regions Detection, Recognition and Its Applications for OCRing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 1
        January 2023
        340 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3572718
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 February 2023
        • Online AM: 29 April 2022
        • Accepted: 16 April 2022
        • Revised: 6 February 2022
        • Received: 8 September 2021
        Published in tallip Volume 22, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!