A Study of Online Review for Homestay in Guangdong Base on Text Mining

On the backdrop of economic development, which has driven the development of the domestic travel industry, the homestay industry has surged in recent years. Together with the advances in information technology and the penetration of social platforms, online accommodation booking has become a mainstream method. According to this background, this research applies user reviews on homestay websites. First of all, we will obtain the review data through the web crawler technology based on 10 homestays in Guangdong. Next, the data is pre-processed with de-duplication and tokenized, and so on. Finally, it is presented in the form of table and word cloud to find out the influencing factors of positive and negative comments and the specific aspects that need to be improved urgently after the tourists' experience of the homestay, which can be used as a reference for decision-making. The results show that each homestay has its own unique advantages. At the same time, there are also some problems that make travelers dissatisfied. Through the analysis, we find out what makes travelers dissatisfied with homestays, and put forward the corresponding resolvent for homestay operators to learn and consult.


INTRODUCTION
In recent years, travel has gradually become a major form of entertainment for the public, and residents' demand for travel accommodations is becoming more and more subdivided.Homestays, which are rich in humanistic feelings and offer good prices, have gradually become the hot trend in travel accommodations, and are favored by different types of consumers.Since July 2016, the number of searches for homestays has been growing, and it even surpassed the number of "hotels" in August.This indicates that the homestays integrated with the tourism industry are more potential [1].According to the relevant market research, as many as 90% of the homestays are booked through online travel websites.Among a large number of travel booking websites, such as Ctrip.comand Qunar.com,homestays have become the major part of accommodation.More than 50% of the orders comes from these two websites.
Relevant statistics indicates that the number of online travel bookings in China presented a rapid growth during the period 2003-2017.It is expected to continue to grow in the future.The volume of online travel bookings in 2016 was $131.39 billion, while the volume of online travel bookings in 2017 was $172.97 billion [1].The 31.6% year-on-year growth shows that the domestic online travel booking market has great potential for development [2].
Booking accommodation online has become the major method, and this has led to an explosion in the number of online reviews, which contain a lot of useful information for merchants to capitalize on in order to understand customer needs or satisfaction [3].Objective information can scientifically and accurately reflect the level of customer satisfaction, and can also provide important information and suggestions for homestay managers.Therefore, obtaining such information is a major issue for the industry.
Therefore, this research analyzes the reviews of "Guangdong's most beautiful homestays [4]" on Ctrip.com by means of text mining technology [5].In addition to combining one-way ANOVA and word frequency analysis to ensure the research's reasonableness, they are also analyzed by the highest top ten terms to produce results for each different homestay.Our results can provide the relevant departments and homestay operators for reference when making decisions on business operation.

LITERATURE REVIEWS 2.1 Text mining
Text mining is a research technique based on natural language processing, which collects, processes, analyzes, and summarizes unstructured data in text format.It can obtain the whole picture of the data and explore the characteristics of the data in various aspects [6,7].One of its main strengths is its ability to process large amounts of textual content and dig deep for hidden credits, which is especially important for studying consumer experience through online consumer reviews [8].In addition to online reviews, other applications such as election polls, public opinion surveys, classification of medical diseases, analysis of criminal behavior, word-of-mouth reviews of stores, and decision-making assistance fields are all successful examples of text mining technology.These examples demonstrate the application value and potential of text mining.

Web crawler
Web crawler is a program that obtains information about a target website through a URL, parses the data, and then saves the target information in a database, which can be automated by writing a python program.Crawling web page data is localized by the path characteristics of HTML elements, and it is necessary to observe the regularity characteristics of each path node of the web page [9].Its core is based on the batch acquisition of page information from the Internet and take it down and complie the rules according to the user's needs.It can quickly and automatically scan and extract the web page data, and realize automatic data cleaning and preservation.
Web crawling techniques are of various types: general crawlers, thematic crawlers, python-based crawlers, and collector crawlers of Bazhuayu.In this research, the Collector Crawlers of Bazhuayu is chosen as the technical tool for the research of web crawlers.Collector Crawlers of Bazhuayu is a mature data collection and extraction tool that automatically searches relevant webpages based on the keywords entered by the user, compared to manual searches.Therefore, efficiency and accuracy are greatly improved, information is obtained at a reduced cost, and software is easily accessible.Therefore, the Collector Crawlers of Bazhuayu is chosen as the data acquisition tool for the web crawler, which is beneficial to obtain the relevant data of the top ten homestay reviews in Guangdong from the relevant platforms of "online homestay", so as to analyze and explore the data in the subsequent research.

DATA COLLATION AND PRO-PROCESS 3.1 Data selection
In this research, the "Top Ten Most Beautiful Homestays in Guangdong" are used as the sample for this research.As the largest province in terms of economy, Guangdong encounter the rising enthusiasm of nationals for tourism and the revival of the cultural and tourism market after the liberalization of policies.Secondly, Guangdong Province has also seized the opportunity and plans to advance from a large province in terms of tourism consumption to a strong province in terms of tourism, which needs to receive a large number of tourists from China and abroad every year.Guangdong Province is one of the regions in China where the distribution of Homestays is more concentrated.According to the latest report, the number of Homestay inns in Guangdong has exceeded 50,000 in 2022 [10].Different Homestays in different regions have distinctive characteristics.Therefore, this research selects the homestays from the top ten list of "2021 Guangdong's Most Beautiful homestays" which was jointly organized as the sample for study (In this research, code name was used to Homestays' notation.).

Data collection
The data is selected for the period up to April 1, 2023, and all reviews of the top ten homestays are obtained for analysis.The Collector Crawlers of Bazhuayu is also used to capture the total reviews of the target homestays, categorize all online reviews into favorable, medium and bad reviews, and carry out descriptive statistical analysis.

Data preparation and cleaning
Through the Collector Crawlers of Bazhuayu, text information is captured, organized and analyzed, and presented in a tabular format to make the information more understandable.Meanwhile, the resulting high-frequency vocabulary is presented through the word cloud.The higher the frequency is, the larger the font size presented is.However, due to different backgrounds and linguistic expressions of the tourists; in order to better extract and analyze the text, a simple pre-processing of the evaluation data is needed.Therefore, there are two types of data collected, one is the reviews with positive and negative comments.The other is the reviews with pure ratings, no other textual descriptions are included.The collected reviews are needed to be adjusted.
(1) Delete irrelevant terms, such as "emoticons, quantifiers, and quotations", and then categorize them according to the content they present, such as traveler's region, year of stay, customer ratings, and review content, and place them in the corresponding positions.(2) Standardize similar expressions, such as replacing "landlord, customer service" with "boss, housekeeper".

Descriptive statistics of homestay review types
By categorizing and organizing the data, a statistical analysis regarding review types was conducted for ten homestays.First, the favorable reviews show that the average of favorable ratings is 4.94 for E Homestay, 4.92 for D Homestay, 4.83 for A Homestay and F Homestay receives the lowest average rating, whose value is 4.84.
Secondly, in terms of the average of medium ratings, B Homestay receives 3.41 which is the highest, F Homestay and G Homestay are the next, whose values are 3.38.The second lowest value is 3.18 for H Homestay, and 3.16 is the lowest for E Homestay.
As for the average ratings of bad reviews, D Homestay is the highest at 2.60, A Homestay is the second highest at 2.50, J Homestay is the second lowest at 1.63, and B Homestay is the lowest at 1.50.

One-way ANOVA
In this research, one-way ANOVA is used to compare the differences in rating scores across homestays.The results show that the F-value is 10.490 and p-value is small than 0.001.Therefore, there exits significant differences in rating scores across hotels.Specifically, we found that the score of A Homestay is significantly higher than that of I Homestay, while the scores of B Homestay, F Homestay, G Homestay, H Homestay, and J Homestay were significantly higher than those of C Homestay, E Homestay, and J Homestay.
In addition, the chi-square test can further compare the differences between the evaluation types of homestays.Our results indicate that the value of chi-square test is 98.303 and p-value is smaller than 0.001.Therefore, there exists a significant difference between the evaluation types of hotels and guesthouses.Specifically, when the rating type is poor, the proportion of C homestay is significantly higher than that of J homestay, and the proportion of I homestay is significantly higher than that of F homestay, G homestay and J homestay.Next, in the case of medium ratings, E homestay, F homestay and I homestay are significantly higher than J homestay.Finally, in terms of favorable reviews, J homestay has a significantly higher percentage of good ratings than C homestay, E homestay, F homestay and I homestay.

The analysis for word frequency and word cloud
Not all of the reviews on online platforms are text reviews, there are some reviews that are just comments without positive or negative text comments.However, the number of reviews with text comments varies across different homestays.Among the ten homestays, I homestay has the most review with a total of 700 reviews, in which 699 reviews contain comments.D homestay has the least reviewed with only 23 reviews in total.Among the ten reviews of homestays, the difference between reviews with text and those without text is relatively large, and the proportion of reviews without text is much smaller than that with text.This research focuses on the word frequency analysis of the favorable and bad reviews of ten homestays on Ctrip.com, an online ordering platform, and the analyzed nouns and adjectives are ranked according to their lexical frequency.The results show that among the top ten favorable reviews, the total number of frequency for I homestay is 1407, accounting for 15.4% of the original data.As for the top 10 bad comments, the total number of frequency is 362, accounting for 32.8% of the original data.
As shown in Figure 1, the higher frequency of words in the favorable reviews of I Homestay are "service", "hotel" and "housekeeper".The first ranked term is "service", which also indicates that the majority of consumers have a favorable evaluation of the services provided by I Homestay.The second highest ranked term is "hotel", which is the most frequent positively rated term, which indicates that consumers generally have a favorable perception of hotels.In addition, because of humanized service staff, it has also become the characteristics of I-Homestay.Travelers also mentioned service and housekeeper in the favorable reviews, therefore, this is why "housekeeper" is ranked third.Moreover, in the analysis of the negative comments, we can see that "room" is ranked first and has a higher frequency compared to other words.This shows that some consumers are not very satisfied with the services provided by the room.
Next, under the same method, the results of favorable and bad reviews for other Homestays are shown in Figure 2 to Figure 10.
A horizontal comparison of the top three terms in terms of positive comments in the ten homestays' word frequency analysis reveals several features.First, the word "homestay" basically appears in the top three rankings of the ten homestays, and it can be found that most of the travelers who left their reviews were highly positive about many aspects of the homestays.Second, the word "service" appears in the top three rankings for at least four homestays.This phenomenon indicates that most travelers have a favorable opinion regarding the service aspect of homestays.Although the term is not found in the top three ranked terms for other homestays, however, it still appears in the top ten ranked terms.It means that some travelers are still positive about the services of homestays, but not as much as other parts of the homestays, such as the environment or the rooms.Third, the word "room" appeared in the top three rankings for five homestays.This shows that most of the travelers who left positive reviews are satisfied with the facilities and services, and have a comfortable stay with good ratings.Finally, there are "swimming pool"-related facilities at the C-Homestay, E-Homestay and H-Homestay.The word "swimming pool" appears in the top three rankings for the three homestays.This indicates that this featured facility is indeed one of the important needs for travelers and results in many favorable evaluations.
On the other hand, the top three negative reviews are compared horizontally.First, although the word "room" appears quite frequently in the favorable reviews, it also appears in the negative reviews for seven homestays, such as "B-Homestay, C-Homestay and I-Homestay", and so on, and is ranked in one of the top three.It can be seen that although most of the travelers who left reviews are positive, there are still some travelers who had a negative impression regarding some parts of the room.Second, the word "breakfast" is found in the top three word frequencies of negative reviews for D-Homestay and H-Homestay.This shows that the problem of "low variety and taste bad" of breakfast is the trouble for some travelers.Therefore, it is necessary for the two lodgings to improve the breakfast-related problems.Third, the word "no" appeared in the top three rankings for B-Homestay, G-Homestay and I-Homestay.This means that the facilities and services provided by homestays are not perfect enough to meet the needs of travelers, which in turn leads to negative comments for some travelers.Therefore, the homestays mentioned above have to improve the relevant issues so as to meet travelers' satisfaction.

CONCLUSION AND SUGGESTION 5.1 Conclusion
This research analyzes the online reviews of the top ten most beautiful lodgings in Guangdong, China, and finds that tourists are generally satisfied with local lodgings, but are not yet particularly     satisfied.Therefore, there is still room for improvement.Our study is mainly based on tourists' comments on homestays in the wellknown online ordering platforms.Ten high-frequency vocabularies of homestays are counted, and are analyzed by descriptive statistics and one-way ANOVA.The results are presented visually through the word cloud.According to the data analysis, the following results are obtained: First, there are significant differences in ratings and types of reviews across homestays.The online reviews are categorized into two types: rating type and comment type for descriptive statistical analysis, and the mean and standard deviation of the ten homestays are obtained.It can be seen that there is a significant difference in the ratings among different homestays.This phenomenon indicates that the operators need to learn the experiences from other excellent homestays and adjust their business strategies to provide more satisfying services to the tourists.
Second, homestays generally have favorable ratings for "homestay" and "room", while and bad ratings for "room" and "breakfast".By looking at the favorable and negative reviews of ten homestays, and then analyzing the top ten high-frequency words for each homestay, we find that the homestays have higher degrees of favorability for "homestays" and "rooms", which is also an aspect they are more concerned about.However, the word "room" also appears in the high-frequency vocabulary of negative reviews, indicating that some travelers are dissatisfied with the "room" of homestays, so the operators need to improve the facilities of homestay rooms.
Third, the existence of certain special facilities in each homestay attracts travelers.Through the descriptive statistical analysis of the observation and evaluation regarding classification type, the results show that among the homestays with high average ratings, the word "swimming pool" is mentioned in "E Homestay, C Homestay, and H Homestay".It is also a positively rated term by travelers in the high-frequency glossary.From this, it is clear that homestays with  their own special services or facilities can better attract travelers.Therefore, the operators need to create their own unique advantages to attract more travelers and enhance the popularity of homestays.

Suggestion
Although we obtain several interesting and valuable suggestions in this study, there still exists some shortcomings: for example, insufficient sample.In the process of data collection, which was mainly based on online review data, the sample of some homestays is too small.Therefore, it is unable to obtain more reliable results in an in-depth analysis of homestay travelers' satisfaction.In future research, it needs to combine with traditional questionnaires and online questionnaires so that a larger sample size can be obtained.In addition, it is necessary to analyze whether there are significant differences when comparing the service characteristics and service quality levels of different homestays.
As different homestays offer different experiences to consumers, they generate different customer perceptions; these customer perceptions will also influence repeated consumption behavior, which in turn will promote the homestay industry.Therefore, the homestay operators can start to adjust their management strategies based on negative customer reviews, and improve their facilities or services accordingly.In addition, it is suggested that the operators make reference to excellent homestay models, learn from good practices, and create their own unique homestays according to local conditions.It is hoped that the results of this study can bring some valuable opinions to homestay operators.

Figure 1 :
Figure 1: I-Homestay for favorable and bad reviews of world cloud.

Figure 2 :
Figure 2: A-Homestay for favorable and bad reviews of world cloud.

Figure 3 :
Figure 3: B-Homestay for favorable and bad reviews of world cloud.

Figure 4 :
Figure 4: C-Homestay for favorable and bad reviews of world cloud.

Figure 5 :
Figure 5: D-Homestay for favorable and bad reviews of world cloud.

Figure 6 :
Figure 6: E-Homestay for favorable and bad reviews of world cloud.

Figure 7 :
Figure 7: F-Homestay for favorable and bad reviews of world cloud.

Figure 8 :
Figure 8: G-Homestay for favorable and bad reviews of world cloud.

Figure 9 :
Figure 9: H-Homestay for favorable and bad reviews of world cloud.

Figure 10 :
Figure 10: J-Homestay for favorable and bad reviews of world cloud.

Table 1 :
The descriptive statistics for scoring

Table 1
for A Homestay, 4.496 for E Homestay, and 4.568 for D Homestay.As for the homestays below 4.45, the two worst homestays, namely C Homestay and I Homestay, have values of 4.451 and 4.419, respectively.