Using Twitter Data to Evaluate Pangolin Conservation Awareness

Pangolins are the most heavily trafficked species in the world. Despite this, they remain relatively unknown compared to other species also at risk from over-exploitation. Public awareness of the species is essential for developing successful conservation initiatives. Therefore, understanding who discusses pangolins and their feelings about the species are crucial pieces of information. This paper introduces a method for analyzing social media content on Twitter to determine sentiment and identify demographics of interest to conservationists. We leveraged Latent Dirichlet allocation (LDA) for topic modeling and sentiment analysis to measure awareness and further understand the discourse surrounding pangolins. Our results showcase only one significant group of individuals making tweets about pangolins. Accounts that mention "animal", "love", and "conservation" in their bios were primarily posting content about the species. When analyzing the sentiment of tweets mentioning pangolins, there was a positive relationship between certain outreach measures, such as nature documentaries and the advent of World Pangolin Day, and the feelings/emotions about the species. It also showcased the impact of COVID-19 on the species, which effectively eradicated the progress made in increasing positive sentiment toward pangolins over the seven years prior.


INTRODUCTION
Pangolins are a critically endangered species at threat of extinction due to their meat, scales, and perceived medical properties in traditional Eastern medicine [3].Despite pangolins being the most trafficked species in the world, they are still relatively unknown compared to other heavily poached species, such as elephants and rhinos [17].Because awareness has a direct relationship to the success of many conservation initiatives, government agencies, nonprofits, and private organizations invest resources into outreach in an attempt to increase awareness [4,11,13,19].
Social media is a powerful tool that conservation scientists have begun to utilize [5,18].For example, images posted on social media are used to train computer vision algorithms for species identification and textual data has been used to identify evidence of the illegal wildlife trade [18,21].Given 1/3 of all people worldwide and 2/3 of all internet users are active on social media, these platforms are instrumental not only for species monitoring but for outreach [7,15].Evaluating social media posts for information on who and what people are saying about the pangolin can help develop impactful outreach initiatives and, ultimately, protect the pangolin.
This paper utilizes Latent Dirichlet allocation (LDA) and sentiment analysis on Twitter data to characterize the discourse surrounding pangolins online.We do this by identifying different groups of people tweeting about pangolins and the sentiment of their tweets over time.Not only is our study the first to explore the intersection of social media, natural language processing, and pangolin conservation, but our study is also the first to look into users' "bio" descriptions to define different user groups.

BACKGROUND AND RELATED WORK 2.1 Role of Social Media in Conservation Science
Two of the most relevant prior works documented in our literature review are those of Toivonen et al., and Di Minin et al..The authors documented the types of data available on several social media platforms, including Twitter, and quantified the value of each of those data points to conservation scientists.They particularly mention the properties of the user "bio" description and believe the field can provide relevant information on who posts what and when on social media and how the opinions differ among different groups of people [5,18].

Topic Modeling and Sentiment Analysis on Twitter
Conducting LDA-based topic modeling is a well-documented method for determining trends across several domains, including conservation [8,9,14].The most notable prior work is that of Ohtani, who conducted LDA topic modeling and sentiment analysis on tweets to measure general awareness and sentiment of biodiversity on Twitter [14].The author's methodology can be applied to a more specific subset of Twitter data, such as content that references pangolins.

Data Gathering
We utilized Twitter over other social media platforms as Toivonen et al. notes that Twitter is the most well-cited platform that still provides API access to textual data [18].Utilizing the Twitter API, we gathered all tweets mentioning pangolins since Twitter's inception in 2007.Along with the tweet, we also gathered the tweet's author ID, the language of the tweet, and the time the tweet was posted.Using the author IDs from the previous step, we utilized the Twitter API once more to gather the account bio of the users that made tweets about pangolins.To protect the anonymity of the accounts, no usernames or locations were utilized in this study.In total, 2,674,651 million tweets mentioning pangolins were gathered with posting dates between January 2007 and November 2022.Of those, 1,695,541 were written in English.Those 1.7 million tweets were written by 682,522 distinct Twitter users.Of those, 560,923 had a populated user bio that could be analyzed for topic modeling.

Pre-Processing
Several steps were taken to standardize the textual data before topic modeling was conducted.These steps are important to improve the efficiency and interpretability of the model output [6].
• Case Standardization: All letters are set to lowercase.
• Punctuation Removal: All punctuation and hyperlinks are removed.• Tokenization: All bios were broken down from sentences into smaller units that a program can work with.These are referred to as tokens.• Removal of Stopwords: Commonly used words (a, an, for, etc.) are removed.Twitter-specific stop words such as 'rt', 'png', and hyperlinks are also removed.• Lemmatization: Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item (i.e., run and running have the same meaning).• Identify of Bigrams: Identify common words that often show up together (e.g., "social media", where "social" and "media" are likely to show up together).

Topic Modeling
Topic modeling is the process of discovering themes through textual data by analyzing words within texts [2].We utilized LDA over other models as it is commonly used for topic modeling and has demonstrated success on large data sources gathered from Twitter [8,12,22].In this study, we utilized Gensim, a popular Python library used for unsupervised topic modeling, to group Twitter users by account bios.Topic models are evaluated using coherence scores, which represent the average similarity between the most frequent words within a topic.We utilized the UMass coherence score as it is reported to have the fastest run times for large datasets [16].We performed hyperparameter tuning by adjusting the corpus size, alpha, beta, and topic cluster assignments.The model with the highest coherence score is considered the most optimal.

Sentiment Analysis
Sentiment analysis is the process of extracting opinions, emotions, and moods from text [14].In this study, we utilized TextBlob, a popular natural language processing toolkit that allows users to determine the polarity or sentiment within the tweets.Polarity is represented by a score between -1 and 1.We defined negative sentiment as any tweet with polarity between [-1,0), neutral for any tweet with polarity = 0, and positive for any tweet with a polarity between (0,1] [10, 20].

Topic Modeling
We conducted hyperparameter tuning on the pre-processed user bios and chose the best model by selecting the one with the highest coherence score.The best-performing model produced just 1 topic cluster.Fig. 1 showcases the common words appearing in the topic cluster, such as "animal", "love", and "wildlife".This indicates that of the accounts that have posted about pangolins, the ones with populated user bios are very much centered around a love for wildlife and conservation.

Sentiment Analysis
Fig. 2 showcases the number of tweets written about pangolins per year by sentiment.There was a significant increase in the number of tweets written about pangolins with the rise of Twitter users through the 2010s.The peak in 2020 showcases the impact of the COVID-19 pandemic on discourse surrounding the pangolin.The species was briefly believed to be the source of the pandemic, which might be the cause of the flurry of content posted that year.showcases the percentage of tweets made each year with each sentiment.Using these results, we can evaluate the impact of particular outreach measures.In 2012, World Pangolin Day, a global movement aimed at raising awareness of the species, was created by Rhishja Cota [17].Also in 2012, Pangolins were mentioned in the UK's most watched nature documentary of the year as one of the top 10 species Sir David Attenborough wished to see saved from extinction [17].In that year, tweets with positive sentiment reached all-time highs whereas tweets with negative sentiment remained at global minimums.
In 2014, the Prince of Wales announced a pangolin-themed video game by the creators of Angry Birds.The game, titled "Roll with the Pangolins", was available for one week in November 2014.This outreach initiative could have contributed to the increased positive sentiment and decreased negative sentiment compared to the prior year.However, it appears the temporary video game release did not have the same impact as the initiatives set forth in 2017.
In 2015 and 2017, Google featured pangolins as a part of their "Google Doodles", which are changes Google temporarily makes to their logo to commemorate holidays and other celebrations of individuals [17].These changes are typically interactive, where Google users can click on the logo to be taken to more detailed information on the theme of the Google Doodle.[17].Compared to outreach efforts from prior years, Google Doodles did not have as noticeable an impact on the sentiment.In 2015, while negative sentiment decreased, positive sentiment remained nearly unchanged from 2014 levels.In 2017, negative sentiment actually increased while positive sentiment decreased from 2016 levels.
In 2016, CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) hosted its 17th Conference of the Parties (CoP).This meeting is centered on increasing the awareness of poached species, which includes the pangolin.One outcome of this event was the decision to move pangolins to Appendix I, which is reserved for the world's most endangered plants and animals, such as tigers and gorillas.This decision was highly publicized, and lead to a significant increase in Google searches on the species [17].This event also correlates with an increase in positive sentiment and a slight decrease in negative sentiment from 2015 levels.
In 2020, we can see the impact of the circulated theory that pangolins were the source of the COVID-19 pandemic.Positive sentiment towards the species that had been built up due to the influx of awareness campaigns from 2012-2019 had been wiped out.This correlates with a documented increase in the hunting and trading of pangolins in countries such as India [1].However, the trends in sentiment appear to be correcting to pre-pandemic levels.

FUTURE WORK
The results of this study unearthed many future avenues of investigation.The first is to analyze the sentiment of tweets by additional demographic stratifications.Understanding how sentiment varies by nationality, location, gender, or profession could further subdivide the population of individuals to determine audiences for targeted outreach approaches.
Another future avenue of study is analyzing the subjectivity, or the amount of personal opinion versus factual information, contained within tweets.Given the impact COVID-19 had on the content posted about pangolins, understanding the shift in amounts of factual vs. opinion-based content could help combat misinformation about an already troubled species.
Lastly, we plan on expanding this methodology to other heavily poached species -namely the rhino.Comparing sentiment and topic clusters between the pangolin and rhino could lend insights into how sentiment and awareness varies based on how identifiable the species is without global awareness campaigns.

CONCLUSIONS
In this study, we conducted topic modeling using Latent Dirichlet allocation (LDA) and sentiment analysis to evaluate pangolin awareness and sentiment on Twitter.Our results showcase that the Twitter users posting about pangolins are individuals that have a documented interest in animals, wildlife, and conservation.Additionally, sentiment on the pangolins is heavily influenced by outreach measures.TV documentaries and awareness campaigns such as World Pangolin Day have the largest impact on positive sentiment.However, the COVID-19 pandemic greatly increased negative sentiment toward the species.This showcases the importance of outreach measures focused on combating misinformation on the species' involvement with the start of the pandemic.Conservationists can utilize these data points to expand the number of individuals that are aware of the species in a positive way with the goal of protecting the pangolin from extinction.

Figure 1 :
Figure 1: Common Words within Twitter User Bio

Figure 2 :
Figure 2: # Tweets Posted per Year by Sentiment Fig 3. showcases the percentage of tweets made each year with each sentiment.Using these results, we can evaluate the impact of particular outreach measures.In 2012, World Pangolin Day, a global movement aimed at raising awareness of the species, was created

Figure 3 :
Figure 3: Percentage of Tweet Posted per Year by Sentiment