Satisfaction Evaluation System of Macau Attractions Based on Online Evaluation Data

Online reviews are a valuable source for understanding tourist satisfaction and their emotional tendencies towards attractions. However, there is a need to improve the quality screening of reviews before conducting sentiment analysis. This paper focuses on the important attractions in Macao and utilizes data collected from Ctrip.com to establish an evaluation system. The system scores and ranks the validity of online reviews, and various machine learning methods are examined to automate the screening process. The study finds that XGBoost + stacking is the most effective method, with the upper quartile serving as the threshold for review selection. By employing text sentiment analysis technology, the study evaluates visitor satisfaction for each attraction using quality reviews. This approach contributes to the development of a more comprehensive satisfaction evaluation system and offers a fresh analytical perspective for studying Macau tourism.


INTRODUCTION
In recent years, there has been a significant integration and interaction between the tourism industry and the Internet, leading to the flourishing of tourism worldwide.Following the easing of pandemic restrictions, the number of tourists visiting Macau has gradually returned to pre-epidemic levels.As a special administrative region of China, Macau boasts abundant tourism resources and distinctive cultural charms.With its globally renowned gaming industry, rich historical heritage that blends Chinese and Western cultures, and unique architectural style, Macau has become an attractive destination for a multitude of tourists to explore and discover.Ensuring a high level of satisfaction in tourist attractions relies not only on the inherent appeal of the destination but also on understanding the thoughts and emotions of visitors.Recognizing this, it becomes crucial to focus on the experiences and emotional feedback of tourists.In light of existing theories on tourist satisfaction at scenic spots [1], online reviews serve as a valuable medium for capturing the emotional inclinations of tourists from across the country.These reviews hold significant value for both tourists and professionals in the tourism industry.Hence, this study aims to analyze visitor satisfaction at each significant attraction in Macau using online review data, offering valuable insights for the tourism sector in Macau.
Online reviews possess several key characteristics, including their vast quantity, complex context, and rich semantics.Many scholars both domestically and internationally have focused on the analysis of user evaluation texts, extracting emotional tendencies, and studying customer satisfaction in order to provide valuable insights to industry practitioners.For instance, Mauri & Minazzi et al. [2] conducted research demonstrating that online reviews can influence potential tourists' consumption inclinations to a certain extent.Fan Ning [3] investigated customer satisfaction and identified tendencies in the context of B&Bs based on reviews from online travel websites.Lopamudra et al. [4] utilized sentiment analysis of travelers' online speeches in R language to elucidate the travel motivations of Indian travelers during the coronavirus pandemic.Similarly, Wu Kaijun et al. [5] selected 112 travelogue texts published by tourists after travelling to Macau, and used content analysis as the research method to analyse the satisfaction of different tourism elements, and obtained the overall tourism perception, in which the vast majority of tourists rated the attractions as a higher tourism element.Therefore, studying the extensive comments left by tourists on online travel websites, identifying descriptive factors within these comments, and establishing the emotional connections between them are of significant importance in evaluating the satisfaction of Macau's attractions.
However, online reviews often suffer from issues such as irrelevance, plagiarism, and lack of valid content, which poses a challenge for visitors seeking valuable information.Existing research on online review data primarily focuses on removing blank comments and advertisements during data preprocessing and assigning topic relevance as a weight when calculating sentiment scores.However, these approaches do not involve a thorough screening process to identify high-quality reviews.The presence of low-quality comments can significantly impact the accuracy of the analysis.Therefore, it is crucial to assess the quality of comments to ensure    the reliability and accuracy of the analysis.The research process of this paper is illustrated in Figure 1.It involves analyzing the effectiveness of online comments, identifying high-quality comments through screening, and conducting sentiment analysis to enhance the effectiveness and accuracy of the analysis results.This study builds upon the previous research conducted by scholars and addresses the limitations in text selection.It surpasses simple data screening and cleaning, as well as manual annotation, by developing an automated method to identify high-quality comments.This approach leads to a more comprehensive analysis of the attractions in Macau, enabling potential tourists to access valuable content more effectively and efficiently [6].Additionally, it facilitates the extraction of suggestions for improving the services provided at the attractions by utilizing valuable review information, benefiting tourism practitioners and relevant stakeholders [7].

DATA ACQUISITION AND PREPROCESSING 2.1 Data Selection
Since this paper focuses on studying online reviews of various attractions in Macau, the main objective is to identify the impact of different attractions on tourists' affective tendencies.It is believed that simply changing online platforms or increasing the number of platforms would have minimal effect.To this end, the industry website ranking by Webmaster Home (https://www.chinaz.com/) was chosen, based on factors like Baidu weight, Alexa ranking, and PR value (Table 1).
After comparing data in Table 1, Ctrip.com emerged as the primary data source.Among Macau attractions on Ctrip.com, the ten with the highest review counts were chosen for the study.The selected attractions are as follows: Macau Tower, The Venetian, St Paul's, City of Dreams Resort, Rua da Cunha, Fortress, Macau Science Centre, Macau Parisian Tower, Senado Square, Macau Fisherman's Wharf.The attractions cover the Macao Peninsula and Taipa/Coloane islands, encompassing parks, forts, museums, exhibition halls, and World Heritage sites.This diverse selection facilitates further research experimentation.

Data Collection
Ctrip attractions provide extensive web review data (Figure 2), with each review containing user details, ratings, text, images, timestamps, IP genus, likes, and more.
In this paper, we will utilize crawling techniques [8]to gather the necessary online review data for the study.Considering the concise design and extensive crawler libraries and frameworks available in Python, we have chosen Python as the implementation language for data collection.Specifically, we will focus on Ctrip.comattractions as the platform and crawl the first 60 pages of online review data for the top ten attractions listed on the website.To achieve this, we will leverage Python's crawler libraries, namely Requests and BeautifulSoup, to retrieve the HTML content of the web pages.Subsequently, we will parse the HTML content to extract the relevant review information.It is important to note that the Ctrip website may utilize different HTML structures or CSS styles to render its pages.Therefore, when parsing the HTML, we will need to handle the extraction of comments with images and plain text comments separately.After performing simple deduplication operations, such as removing comments with identical IDs, we obtained a total of 5,765 raw data entries.

Data Preprocessing
Based on the 5,765 raw data entries obtained thus far, further processing is required for the indicators, except for the number of likes, which can be directly used after merging into one column.The remaining indicators are processed as follows: (1)Time Difference: To begin with, the comment data needs to be consolidated, as there is an issue with the format of the band chart and the two columns should be merged into one.The next step is to calculate the time difference in days between each comment's timestamp and the experiment's deadline date (8th November 2023).This time difference provides information on the temporal proximity of each comment to the experiment's deadline.According to the theory of information attenuation [9], it is observed that as time passes, the relevance and impact of comments may weaken.This reduction in timeliness could potentially diminish their value to readers.Therefore, the time difference serves as a feature indicator.
(2) Text Length: Many comments include emoticons, special symbols, and other elements that can potentially interfere with the validity assessment of comments and the study of text sentiment analysis.To mitigate this interference and enhance the reliability and accuracy of the study, further processing of the comment text is necessary.Unwanted content can be easily identified and filtered using regular expressions.Regular expression matching helps identify and filter out invalid links, spaces, symbols, line breaks, and other undesired elements.Additionally, the len() function can be utilized to obtain the character count of each comment text, serving as a feature indicator.
(3) Number of Images: By leveraging web crawler technology, we are able to extract the image links present in each comment.Through a loop traversal in Python, we can calculate the length of the image content using the len() function.This enables us to determine the number of images in each comment, serving as a feature indicator.
(4) Rating: The ratings obtained through crawling include both numeric values and accompanying text, such as "Five points, excellent(5分 超棒)" However, this semantic repetition can be cumbersome and not conducive to subsequent computational research.To address this, we can utilize Python's built-in itertools.groupby()function to separate the text and numeric components.By employing a list comprehension with a conditional statement (if is_digit), we can extract and retain only the numerical part of the rating as a feature indicator.
(5) Topic Relevance: For this study, the topic relevance of each comment will be calculated using the LDA (Latent Dirichlet Allocation) model [10,11].To determine the appropriate number of topics for the model, the perplexity-consistency method [12] will be employed.This method combines perplexity and topic consistency, with lower The Jieba lexicon is utilized to process the read comment data with a Chinese lexicon.Parameters such as the number of intercepted topic words and deactivated words are set accordingly.Subsequently, the comment dataset is trained using the LDA model with varying numbers of topics.The results are presented in Figure 3 and Figure 4.It can be observed that when the number of topics is chosen as 28, the confusion degree is minimized, the topic consistency is relatively high, and the overall model quality is deemed good.
Based on the findings from Figure 3 and Figure 4, the number of themes is ultimately determined to be 28.For each theme, the top 10 keywords that best represent its content are selected.The extraction of theme relevance is conducted based on the LDA model.As a result, two matrices are obtained: the theme-keyword matrix and the topic probability distribution matrix.Table 2 presents a selection of topics along with the corresponding keywords.It can be observed that topic 1 is associated with tourist activities, topic 2 revolves around plaza scenery, and topic 3 is centered on Portuguese cuisine.
Table 3 illustrates the topic probability distribution associated with a selection of comments.This distribution indicates the relative importance of each topic within the comments.For instance,  The topic with the highest probability value in the topic probability distribution tends to be the most relevant topic for the comment's content.In this study, the maximum probability value for a given topic is recorded as the topic relevance score.The topic relevance score is then utilized as an indicator feature.
The above preprocessing steps completed the preprocessing of the required feature metrics, which were rearranged and aggregated to obtain the final dataset available for the study.Each web comment data contains field information as shown in Table 4, which contains these six feature attributes.

Data Standardization
Table 5 provides a descriptive analysis of the data.It is observed that some data points exhibit a significant degree of dispersion.Hence, data standardization plays a crucial role in ensuring that all indicators are transformed to the same order of magnitude.This step enables more accurate and reliable analysis and processing of the data.
Considering the diverse nature of the indicators, it is necessary to normalize them to a common range for effective comparison.Additionally, the six indicators encompass both positive and negative attributes, requiring positive normalization.To accomplish this, a Min-Max standardization method was selected, incorporating both scaling and standardization techniques, to achieve data standardization for this experiment.In this study, there are a total of m objects to be evaluated and n evaluation indices, forming the data matrix = ( ) * .Specifically, there are 5776 comments and 6 evaluation indices, denoting = 5776 and = 6.After performing indicator normalization, let the element in the data matrix be denoted as .If represents a negative indicator, which implies that a smaller value is considered better, as is the case with the "time difference" in the table, we employ the following formula for processing: If represents a positive indicator, meaning that a larger value is considered better, as is the case with all the other indicators in the table, we utilize the following formula for processing: Through the combination of normalization processing and Min-Max standardization, all indicators are effectively mapped to a range of 0 to 1.This transformation allows for the comparability and interpretability of the indicators.

COMMENT VALIDITY ANALYSIS 3.1 CRITIC Weighting Method for Indicator Weights Determination
In this study, the CRITIC weighting method is employed to accurately assess and analyze the features of online comments.The CRITIC method, known as an objective assignment method, is utilized to determine the weights of the six features.When compared to other methods such as entropy weighting and standard deviation, the CRITIC method is regarded as a superior assignment method.It was originally proposed by Diakoulaki (1995) [14] for evaluating the relative importance of indicators in overall evaluations.
The CRITIC weighting method considers two crucial aspects, namely contrast and contradiction, during the calculation of indicator weights [14].It takes into account the information carrying capacity of indicator j, denoted as .The calculation formula is as follows:

=
(3) In the aforementioned formula, represents the comparability of the jth indicator with the standard deviation, while represents the degree of contradiction between indicator j and the remaining indicators, reflecting the level of correlation among different indicators.
The weight is determined based on the information carrying capacity .A larger value of indicates a higher weight for the indicator.The formula for calculating the weight is as follows: The final calculation results, as per the formula, are presented in Table 6.The table reveals that among the six feature indicators of online comments, topic relevance carries a higher weight, while the number of likes bears a relatively lower weight.These findings indicate that topic relevance holds significant importance when evaluating and analyzing online comments.

Calculation of the Composite Score of Validity
The composite score of the comment data's validity is computed by employing the weight coefficients assigned to each indicator and the standardized indicator data.This calculation is performed using The outcomes of this calculation are presented in Table 7, representing the results of the output section.
Figure 5 illustrates the data distribution of the composite score of comment validity.The calculated kurtosis is 0.061657704, and the skewness is 0.065471.These statistical indicators provide insights into the data's distribution pattern.With a kurtosis close to 0 and a skewness also close to 0, it can be deduced that the data distribution exhibits relative symmetry without notable spikes or skewness.This suggests a well-balanced distribution of composite scores, with a relatively even occurrence of high and low values and minimal deviations.
Consequently, in this scenario, utilizing the upper quartile as the threshold is a more reasonable choice.Employing the upper quartile as the threshold enables filtering the data to a relatively high level, which is suitable for isolating high-quality or high-value data.This approach aligns with the objective of our experiment.
Additionally, it is worth noting that the upper quartile exhibits a certain level of robustness, meaning it is less affected by outliers.
When compared to statistical measures such as the mean or median, the upper quartile is better equipped to handle extreme values or outliers within the data.Consequently, in this study, the upper quartile of the composite score of all reviews' validity is chosen as the threshold.Reviews that surpass this threshold are considered quality reviews.
The formula for calculating the upper quartile is as follows: Based on the calculation, the final result for the upper quartile is obtained as Q3= 0.5379056130923262.Using this value as a threshold, reviews with a validity score exceeding the upper quartile are identified and selected as quality reviews.It is our belief that reviews surpassing this threshold possess a higher level of validity.These data points are considered more robust, reliable, and hold greater research value.Such reviews are particularly valuable for subsequent studies, including visitor sentiment analysis.
Figure 6 depicts the distribution of quality reviews across ten scenic spots in Macau, highlighting variations in the number of quality reviews among different locations.Notably, Macau Tower stands out with the highest count of quality comments, whereas the Fortress receives relatively fewer quality comments.
It is important to acknowledge that factors beyond the number of quality comments can influence this distribution.Relying solely on the quantity of quality comments is insufficient for assessing visitor impressions of a scenic spot.It is crucial to consider additional factors, such as tourists' personal preferences and opinions.
However, highly valid comments typically represent the primary opinions and sentiment trends of users.By filtering and selecting these quality comments, we can enhance the accuracy of subsequent sentiment analysis.It is important to integrate these valuable insights to gain a more comprehensive understanding.

Classification Prediction of Quality Comments Based on Machine Learning
In the previous section, we developed the quality comment evaluation system.We categorized comments with scores exceeding the threshold as quality comments and labeled them as 1, while the remaining comments were labeled as 0. This process assigned a quality label to each comment, which, along with the six features, transformed the unstructured data into structured data.
With the obtained training dataset, containing input features and corresponding labels, we can utilize supervised machine learning models to train the data.Considering the data's characteristics, we selected the support vector machine (SVM) algorithm and the XGBoost algorithm, known for their superior classification results, for this study.
To ensure the robustness of the models, the dataset was divided into a 7:3 ratio for training and testing purposes.The experiments were conducted using the Python programming language, and parameter tuning was performed using grid search.The flowchart illustrating the process is presented in Figure 7.
In this paper, we selected Accuracy, Precision, Recall, and F1score as the evaluation metrics to assess the performance and quality of the results produced by the machine learning models.These metrics provide valuable insights into the models' effectiveness.
Table 8 presents the final experimental results of the two machine learning models.It showcases the performance of the models based on the evaluation metrics mentioned earlier.
Based on the experimental results, it is evident that XGBoost outperforms SVM in the classification of quality comments.XGBoost exhibits higher overall accuracy, indicating its superior accuracy and effectiveness in identifying quality comments.
To further enhance prediction performance and the model's generalization ability, we propose applying stacking to the XGBoost model.Stacking, as an effective ensemble method [15], can augment the predictive power of XGBoost by combining it with other models.Moreover, by combining the predictions of multiple models, stacking can mitigate overfitting effects and improve the model's generalization.The stacking process involves two layers.In the first layer, features are weighted using stacking, while in the second layer, XG-Boost is used for single-model training.
Comparing the performance of the stacked model with the single base model, the results in Table 9 demonstrate overall improvement in all evaluation metrics.This confirms that applying stacking to the XGBoost model yields better results.

SATISFACTION ANALYSIS 4.1 Acquisition of Factors Influencing
Satisfaction and Calculation of Their Weights (1) tf-idf for acquiring secondary features TF-IDF [16] is a widely used text mining method that evaluates the significance of a word in a given text.The importance of a word within a document is determined by its relevance to the document as a whole.The calculation for word importance within a document can be represented by the following equation: Where TF(t, d) represents the word frequency of term t in document d, and IDF(t, D) represents the inverse document frequency of term t in the corpus D. By calculating the TF-IDF values, we extract the top 30 words with the highest TF-IDF scores as the secondary feature indicators for measuring tourist satisfaction.
(2) Word Embedding with Word2Vec The secondary feature words obtained previously were vectorized using Word2Vec [17] to acquire a numerical representation for each feature word.In this investigation, the gensim module in Python was employed for implementation, with the choice of the Skip-Gram model for training and generating word vectors.Subsequently, each feature word underwent processing to transform into a numerical vector of pre-defined length.
(3) K-Means Clustering for First-Level Features To categorize unlabeled vector data, clustering algorithms are crucial.In this study, the classical K-means [18] algorithm was chosen for this purpose.The KMeans module from sklearn was utilized for clustering analysis.Through multiple iterations adjusting the K value (number of categories), it was observed that optimal results were achieved when K equals 3.This led to a more uniform data distribution and enhanced interpretability.Matplotlib visualized the results in Figure 8, showcasing improved data uniformity across classes.
Drawing upon the outcomes of textual clustering analysis and integrating insights from the consumer satisfaction theory [19], we categorize the influencing factors of attraction satisfaction into three primary dimensions at the first level, namely: scenery, experience, and fun.Subsequently, we identify 30 second-level influencing factors under these primary categories.A subset of the findings is presented in Table 10 At this juncture, we have concluded the selection of satisfactioninfluencing factors.

Calculation of Sentiment Scores using SnowNLP
This study employs the SnowNLP algorithm to assess visitor satisfaction, relying on sentiment analysis principles.The SnowNLP sentiment analysis method utilizes a sentiment lexicon, classifying text into positive and negative categories and providing a probability value for the sentiment-a sentiment score within the range [0,1] [20].A sentiment score nearing 1 indicates a more positive emotional expression, while a score approaching 0 signifies a more negative emotional tone.Following numerous iterations and observations, an adjustment threshold has been established to achieve effective categorization of affective tendencies.Specifically, in this study, the probability of expressing positive emotion in the outcome of emotion analysis is documented as the final emotion score value.

Formulation of the Satisfaction Assessment Function
To gauge attraction satisfaction, the assessment function is introduced with the specific formula: Here, w i represents the weight assigned to the characteristics of the first-level indicator, and Where denotes the total number of comments attributed to a specific first-level indicator, and M k represents the sentiment score value of the k-th comment associated with that particular first-level indicator.Specifically, M k signifies the cumulative probability of positive sentiment in the results of sentiment analysis under that first-level indicator.
Following this evaluation function formula, satisfaction scores under the characteristic weights of each level of indicators for each attraction, as well as the comprehensive satisfaction of the final attraction, are calculated sequentially.To enhance readability, the obtained scores are converted into 5-point ratings using linear mapping, aligning with the five-point scoring system commonly employed by major tourism websites.The calculation results are presented in Table 12 5 CONCLUSION This paper introduces a comprehensive visitor satisfaction evaluation system for Macau attractions, leveraging online reviews.The methodology involves web crawling to acquire review data from reputable platforms, meticulous selection of factors influencing review validity, and the implementation of a scoring and ranking mechanism for review credibility.High-quality reviews are identified by applying a sentiment analysis threshold, contributing to the establishment of an evaluation function designed to assess tourist satisfaction with key attractions in Macau.
The reviews scrutinized in this study are notably more objective, equitable, and comprehensive compared to subjective or biased feedback often found in "water army reviews" or one-sided comments.The outcome of this research furnishes satisfaction ratings from diverse tourists for the same attractions.This information

Figure 1 :
Figure 1: Research flow of this paper

Figure 2 :
Figure 2: A comment from an attraction.The user in this image gave it a rating of 5, meaning superb.This review was posted on 26 September 2023 in Shanghai, China.In the lower right corner is the number of likes for it, which is 1.The comment reads " Macau Tower, located in Nam Van's reclaimed area, boasts a floor area of 13,000 sq m and a height of 338m, making it the world's 10th tallest observation tower.58 floors offer breathtaking views of Macau, Zhuhai, and Hong Kong.Accessible via buses 9A, 18, 23, 32, tickets cost 230 patacas, discounted for seniors over 65.No free entry policy.".

Figure 5 :
Figure 5: Distribution of comment validity scores

Figure 6 :
Figure 6: Distribution of quality reviews by attraction

4. 1 . 1
Indicator Extraction.The process of indicator extraction consists of three steps:

4. 1 . 2
Calculation of Weights.Utilizing 'w' as the weight assigned to the first-level indicator feature, the calculation of Wi is defined by Equation8): Wi = sum of TF − IDF values of all secondary features under the ith first − level feature indicator sum of TF−IDF values of all secondary features(8) Next, compute the sum of TF-IDF values for all secondary features under the three first-level feature indicators along with their corresponding weights 'w'.The calculated results are presented in Table11

3 i=1 w i = 1 ,
denotes the visitor satisfaction corresponding to the first-level indicator.The components of the formula are further defined as follows:

Table
: Ranking of Travel Websites

Table 2 :
Selected themes and their keywords

Table 3 :
Probability distribution of topics corresponding to some of the comments

Table 4 :
Information on Net Assessment Data Fields

Table 5 :
Descriptive analysis of data

Table 7 :
Composite scores for validity of comments

Table 8 :
Evaluation metrics for svm and xgboost

Table 9 :
Evaluation metrics for stacking and xgboost

Table 11 :
Sum of Primary feature tf-idf values and their weights

Table 12 :
Macao Attractions and their Satisfaction Ratings valuable initial insights into Macau's attractions.Simultaneously, it serves as a crucial reference for tourism professionals and relevant authorities to comprehend user experiences and opinions across various attractions in Macau, contributing to informed decision-making and enhancing the overall tourism experience in the region.