Book Recommendation System based on Course Descriptions using Cosine Similarity

Ensuring the retrieval of books that match users' preferences is of paramount importance. A significant challenge users encounter is uncertainty regarding their choice of search terms, often stemming from a limited understanding of the content or exposure to new concepts. Offering users results that closely resemble their query represents one potential solution. This research aims to suggest books relevant to students' course topics, utilizing cosine similarity to compute similarity values within each document in the collection. Performance evaluation using a similarity threshold greater than 0.1 revealed that the retrieved book results achieved an average precision of 0.7 and a recall value of 0.73, indicating substantial alignment with the search terms. The anticipated benefits of the recommendation system encompass the elimination of the need for manual book suggestions by staff, the provision of personalized book recommendations tailored to readers' preferences, a deeper understanding of library user behavior, and the effective promotion of new books that align with users' interests.


INTRODUCTION
The utilization of books within libraries should be adapted in accordance with technologies that are suitable for the changing behaviors of students.Almost every library nowadays employs various digital resources, such as e-books, e-journals, e-theses, and various databases, offered in digital file formats.This is to enhance the convenience of accessing resources, meeting the evolving needs of students.Additionally, many libraries incorporate other technological solutions to further enhance their services.The system for searching books related to specific subjects is a crucial service, as it allows customers to search for information directly related to their subjects or aligned with what they are seeking, increasing the precision of results.Another challenge faced by researchers or learners is not knowing which keywords to use for information searches.This can lead to retrieval results that don't match what they are looking for.However, if there is sufficiently advanced AI, it can provide accurate retrieval results such as books, e-thesis, or anything related to a search term that is automatically sent to the library system for the reader to read more.With the problems, several works have been developed such as The Developing Book Recommendation Service using Data Mining and Augmented Reality Technology [1] Recommender System for E-library by using Collaborative Filtering and user Profile [2] and A Development of Computer Books Recommender System based on Semantic Content-based Filtering with Book Styles.[3] Text mining is used to analyze and categorize book-related data, allowing for the identification of meaningful patterns and relationships.By employing techniques such as cosine similarity, [4] association rules, knowledge graphs, and ontology-based structuring, libraries can enhance their services and provide more accurate book recommendations to customers.This not only aids users in finding relevant books but also contributes to improving the overall user experience through well-designed UI/UX interfaces.[5] Therefore, this research aims to develop a book recommendation system based on course descriptions using cosine similarity.The goal is to provide users with retrieval results that are closely related to their search terms, extract relevant meanings, and establish enhanced knowledge connections.

RELATED WORKS
An effective book recommendation system based on course descriptions using cosine similarity should be capable of providing tailored book suggestions aligned with users' preferences, while also demonstrating increased intelligence.This is achieved by suggesting content that aligns seamlessly with users' needs.To accomplish this, the system incorporates books' titles, tables of contents, and introductions, utilizing text mining techniques to manage and process keywords.Various keywords are then analyzed to uncover relationships, generating results for users through the application of cosine similarity.

Course Description
The courses developed in this instance are part of the undergraduate curriculum in Computer Technology the Department of Computer Education at King Mongkut's University of Technology North Bangkok.The curriculum includes a total of 26 core courses, encompassing subjects such as Computer and Programming, Computer Token filtering is the process of filtering out any tokens that are not useful for application.The token filtering process eliminates the digits, punctuation marks, stop-word tokens, and other unnecessary tokens in the text.

Lemmatization
Lemmatization is the process of resolving the term to Lemma, meaning the part of speech of the term.Lemmatization transforms the word into a proper root form with a part of the speech tagger.

WordNet
Is used to refine words to be contextually relevant.For instance, if there are adjectives or verbs that need to be transformed into nouns, WordNet can be employed.This involves finding the related root words and modifying them according to the desired context.
System Organization, Electronic Device and Instrument, Basic Computer for Education.These courses have been selected to provide comprehensive education to students.

Books
The categorization of books within the central library employs the Library of Congress Classification (LC) system.Bibliographic records are entered into the automated library system, adhering to the Anglo-American Catalog Rules (AACR2) and Machine Readable Cataloging (MARC21) standards to ensure uniform cataloging practices.Books referred to in this context are Thai-language books stored in the library and available for sale in bookstores.This approach is taken as certain subjects lack a significant number of books, such as Machine Learning, Artificial Intelligence, and Big Data.Instead of relying solely on e-books, physical books from the library collection are utilized for text mining purposes.
To supplement this, books used for text mining are identified based on their titles.Additional data is acquired by scanning the tables of contents, introductions, which are then transformed into text format to increase the volume of available information.

Text Mining
Text preprocessing is predominant in text data mining for dimensionality reduction.Once the text is available in the knowledge database, it should be preprocessed to implement the Machine Learning model.It is essential to perform preprocessing on the data with essential steps, namely Tokenization, Lower Casing, Filtering, and Lemmatization.The documents available in the knowledge database are described as a vector in a multi-dimensional area.Every single word has a unique dimension discussed below in Table 1.[6] 2.3.1 Feature Extraction.Text feature extraction involves carrying out a dictionary of terms from the textual data then converting them into a feature set available to the process by using techniques as shown in the processing steps in Table 1.

DeepCut.
DeepCut is a tool used for word segmentation in the Thai language, employing Neural Network techniques for processing and is part of the DeepNLP project, which focuses on natural language processing using Deep Learning technology.
Word segmentation in the Thai language is complex due to the absence of spaces between words, unlike English.Therefore, accurate and meaningful word segmentation is a challenging task.DeepCut utilizes Neural Networks to learn the patterns and characteristics of Thai words, enabling it to accurately segment words within sentences.It excels in handling complex cases, such as nouns related to adjectives, conjugated verbs, and words with specific meanings in Thai.This results in precise word segmentation and word ordering within sentences.[7] 2.3.3The Vector Space Model.The Vector Space Model (VSM) is a widely used model in information retrieval systems due to its simplicity in processing and interpretation.This model represents relationships between words in documents using 2D vectors.The derived words need to undergo normalization, which includes eliminating less significant and frequently occurring words, along with assigning weights to these words.This format is referred to as the Vector Space Model structuring, and sometimes it's known by the term Bag of Words based on the document format it presents.[3,5,8] In the Vector Space Model, the weights for each word in the document are calculated using the TF-IDF principle (Term Frequency-Inverse Document Frequency), introduced by Salton, Wang, and Yang in 1975.This is expressed as a product of the Term Frequency (TF) and Inverse Document Frequency (IDF), as shown in Equation 1).
Whereas in this context, "" represents a document, " (  • )" signifies the frequency of term "  " in document "".Meanwhile, " " stands for the total number of documents in the document corpus, and "  " represents the number of documents in which term "  " appears.

Cosine Similarity.
The method of measuring angular similarity (Cosine Similarity) involves calculating the similarity value of terms within each document present in a document collection.This is done using the formula shown in Equation 2).[3,4] The formula for calculating the Cosine Similarity (  ) between two vectors a and b is as follows: Where: •  represents the dot product of vectors a and b. ∥∥ represents the Euclidean norm (magnitude) of vector a.In the context of text analysis or information retrieval, this formula is used to measure the similarity between two vectors representing documents or terms, often to determine how similar their content or meanings are.Higher cosine similarity values indicate greater similarity between the vectors.

Performance Evaluation
Precision and recall are two important metrics used to evaluate the performance of information retrieval systems.
• Precision is a measure of how many of the retrieved items were actually relevant to the user's query.
The formula for precision is: • Recall is a measure of how many of the relevant items in the collection were retrieved by the system.
The formula for recall is:

Data Collection
Collect related books by searching for books from Thai language websites, totaling 227 books and books from university libraries by selecting QA and TK categories, 524 books, totaling 751 books, linking books to match subjects in the computer technology curriculum, totaling 26 subjects below in Table 2.
Book information includes the ISBN, title, author, description, Introduction or explanation to understand the importance and main content of the book.Table of contents to indicate the content and topics covered in the book and the year it was written.
The data processing procedure began with scanning and using Optical Character Recognition (OCR) technology to convert book images into readable text.During this step, accuracy was verified to ensure the quality and reliability of the obtained information.

Text Mining Process
Text mining in the Thai language is a challenging and intricate process due to the unique characteristics of the Thai script.Unlike other languages, Thai lacks clear spaces between words, posing a significant obstacle when processing textual data.The complex script structure of the Thai language connects characters without distinct spacing, demanding thorough consideration during text analysis.

Data
Cleansing.The initial step in the text mining process involves data preprocessing, which encompasses data cleansing tasks such as removing HTML tags and special characters, as well as eliminating irrelevant or duplicated information.Textual data is refined to ensure consistent comparability, including converting text to lowercase for accurate comparisons.

Tokenization.
After this is the process of tokenization, where the text is segmented into smaller units known as tokens.Tokens can be words or sub word units.In the case of the Thai language's lack of clear spacing, tokenization becomes challenging.To address this situation, employing effective tokenization techniques is crucial.In this research, the PyThaiNLP library was utilized, which is a library designed for natural language processing tasks in the Thai language.
Throughout this research, various tokenization methods were compared, and it was found that PyThaiNLP's DeepCut is one of the libraries that provides satisfactory results aligned with the research requirements.The use of deep learning models significantly improves the accuracy of tokenization.This research holds significance in enhancing the quality and efficiency of data mining in the Thai language.The findings can be applied to analyze text data, producing high-quality and usable results for further research and development endeavors.

Removing Unnecessary Words.
In the process of eliminating unnecessary and non-contributing stop words, English stop words data from the NLTK (Natural Language Toolkit) library were employed.This toolkit is designed to support natural language processing tasks.The English stop words list consists of 179 words.Moreover, stop words data from the PyThaiNLP library were utilized for the Thai language, encompassing a total of 1,030 stop words.In this research, the collaborative approach of tokenization and stop word removal was presented.

Lemmatization for Standardization.
Lemmatization utilizes the function lemmatizer.lemmatize(word,get_wordnet_pos(word)), where word and its part-of-speech are the arguments.After words are tokenized and stop words filtered, the resulting words are passed to the get_wordnet_pos function.This function determines the partof-speech category of each word using WordNet, a lexical database employed for word and meaning management in the language.
Once the word's part-of-speech and the word itself are identified for lemmatization, the lemmatize function transforms the word into its base form based on its identified part-of-speech category.In essence, when these words are input into the lemmatize function, it returns the lemmatized form of the word, completing the process for the program's use.

Content-Based Filtering Approach
The process of creating a book recommendation system using Content-Based Filtering is a step-by-step process that focuses on selecting and recommending books based on the characteristics and features of the book's content.Therefore, this technique aims to create a list of recommendations that are as relevant to the interests and needs of the user as possible.The process has the following steps.

Merging Search
Terms and Book Data.In this step, the search terms that have been processed through text mining are merged with the tokens of the book that have been pre-processed.The search term tokens and book tokens are the results of the previous steps.This merge is an important step to be able to compare search terms to the features of the book.

Extracting Data
Features.After merging the search terms and book data, the search term tokens and the book tokens go through the process of extracting data features for use in the analysis process.In this process, the TF-IDF vectorizer is used to create a vector of unique words in the text dataset.

Measuring Similarity Between Search Terms and Books Using
Cosine Similarity.After creating the vector of search terms and books using the TF-IDF vectorizer in the previous step, the similarity between search terms and books can be measured using the cosine similarities technique.
Cosine similarities is a technique used to measure the similarity between vectors by finding the cosine of the angle between these vectors.In this case, the user's search terms and the vector of the book are compared.When the similarity between the search terms and all the books is calculated, the similarity value for each book is obtained.This value can be used in the book recommendation process by sorting the books by similarity from most to least, or by creating conditions to filter books to recommend the books that are most similar to the user's search terms.
This process helps users to get book recommendations that are relevant to their interests and needs by relying on the similarity of the content and features of the books.Content-Based Filtering is suitable for cases where you want to recommend books using content data primarily, and it can be flexibly customized to meet the needs of the recommendation system.

Evaluation results
To evaluate a book recommendation system, it is necessary to create an answer list of keywords that will be used to test the system.In this experiment, 7 key terms and 3 subjects (Total 10 keywords) were used to test using cosine filter.The result of top 20 books is given a precision and recall equal 0.53 and 0.82 respectively.However, the result of cosine filter with threshold >0.1 is given a precision and recall equal 0.7 and 0.73 respectively and shown in Table 3.
This testing process evaluates the system performance of books recommended ranking based on cosine similarity score using user's keywords.The goal is to provide users with a list of books that align closely with their interests.

CONCLUSIONS
In this research, part of the book was collected from websites.another part comes from the library.Optical character recognition (OCR) is used to convert images to text.A total of 751 books were collected in the experiment.The process of text mining using cosine similarity is used to find the relationship of similar words between keyword and course descriptions, then evaluated by using precision and recall with equal 0.7 and 0.73 respectively.It can be concluded that, the system can be used to recommend book well using search terms.Therefore, it can be applied to a book recommendation system in the library.
The anticipated benefits from the recommendation system include eliminating the need for staff to manually suggest books, providing book recommendations tailored to the readers' preferences, understanding the behavior of library users, and effectively promoting new books that align with users' interests a cosine similarity was used to measure data similarity.

Table 2 :
Number of Books per Subject ∥ ∥ represents the Euclidean norm (magnitude) of vector b.

Table 3 :
Similarity Score for Sample Search Terms with Top20 and Threshold > 0.1