Decoding YouTube's Recommendation System: A Comparative Study of Metadata and GPT-4 Extracted Narratives

YouTube's recommendation system is integral to shaping user experiences by suggesting content based on past interactions using collaborative filtering techniques. Nonetheless, concerns about potential biases and homogeneity in these recommendations are prevalent, with the danger of leading users into filter bubbles and echo chambers that reinforce their pre-existing beliefs. Researchers have sought to understand and address these biases in recommendation systems. However, traditionally, such research has relied primarily on metadata, such as video titles, which does not always encapsulate the full content or context of the videos. This reliance on metadata can overlook the nuances and substantive content of videos, potentially perpetuating the very biases and echo chambers that the research aims to unravel. This study advances the examination of sentiment, toxicity, and emotion within YouTube content by conducting a comparative analysis across various depths of titles and narratives extracted by leveraging GPT-4. Our analysis reveals a clear trend in sentiment, emotion, and toxicity levels as the depth of content analysis increases. Notably, there is a general shift from neutral to positive sentiments in both YouTube video titles and narratives. Emotion analysis indicates an increase in positive emotions, particularly joy, with a corresponding decrease in negative emotions such as anger and disgust in narratives, while video titles show a steady decrease in anger. Additionally, toxicity analysis presents a contrasting pattern, with video titles displaying an upward trend in toxicity, peaking at the greatest depth analyzed, whereas narratives exhibit a high initial toxicity level that sharply decreases and stabilizes at lower depths. These findings suggest that the depth of engagement with video content significantly influences emotional and sentiment expressions.


INTRODUCTION
Have you ever contemplated the significant amount of time you invest in exploring YouTube, particularly when immersed in videos recommended to you? YouTube reports that a staggering 70% of users' time on the platform is dedicated to consuming suggested content.This prominence of recommendations is driven by a multibillion dollar recommendation system, contributing to an average mobile viewing session lasting more than 40 minutes.The recommendation system operates as a continuous loop: as a user engages with a video, a dynamically curated selection of additional content is presented for the user to choose from.Upon making a selection, the process enters a second loop, introducing the user to a fresh set of recommendations, thereby creating an ongoing cycle of suggested content.With a colossal global user base of 2.7 billion [8], YouTube holds significant influence, with a quarter of its American users relying on the platform as the primary source of information [7].This prompts a compelling inquiry: does YouTube's recommendation system play a pivotal role in shaping an individual's narrative?
Originally conceived as a platform for video sharing, YouTube's evolution has aimed at enhancing user engagement and attracting a massive audience.Features like "leanback" fueled its growth, while the integration of artificial intelligence, particularly Google Brain, marked a substantial leap forward in enriching the user experience.These algorithms are frequently linked to biases, including selection bias [20], position bias [1,18], and popularity bias [6,11].Recommendation platforms have been implicated in exhibiting patterns that guide users towards highly uniform content, giving rise to phenomena such as filter bubbles and echo chambers.In these scenarios, users become secluded from diverse content and are instead presented with a more limited range of information.This situation poses the risk of reinforcing specific viewpoints [2,3,12].
Numerous studies have delved into YouTube's recommendation system, primarily focusing on metadata such as video titles and descriptions.While this approach provides a foundational understanding, it falls short in capturing the nuanced narratives embedded within videos.The constraints of video titles often lead creators to opt for concise highlights, potentially overlooking the full narrative.This inclination towards neutral or limited descriptions introduces complexity, as videos may delve into broader topics than initially implied by the title.Additionally, titles are often crafted to be sensational, enticing users to click on the video as seen in Figure 1.
Figure 1: The video illustrates the discrepancy between titles and content, with a title suggesting preparations for "total war" between the US and China.However, it primarily explores the historical relationship between the two nations.This example highlights the importance of context over metadata and demonstrates how sensational titles can misrepresent the less controversial nature of the content.
Continuing this exploration, the paper delves into the intricate South China dispute, interweaving diasporic concerns, foreign policy influence, and global economic impacts.The objective is to unravel nuanced factors within the actual narrative content, with a specific focus on toxicity and emotion.This analysis seeks to reveal distinctions when transcending the limitations of titles and descriptions, exploring how factors indicative of shifts in recommended content types evolve.
To extract narratives from YouTube videos, we leverage one of the largest language models today, GPT-4.This empowers us to obtain abstractive summaries of the content refreshed as narratives.The subsequent sections of the paper are organized as follows: Section 2 discusses related works, followed by Section 3 detailing the methodology used in conducting the study.Results and conclusions are presented in subsequent sections.

RELATED WORK
In this segment, we delve into prior research pertaining to our study, encompassing examinations of morality assessment, emotion detection, and bias analysis within recommended systems.Extensive research has been conducted on recommendation bias, particularly in realms such as radicalization and the dissemination of misinformation and disinformation [14].Previous studies have explored the formation of hemophiliac communities within suggested content, like videos, and scrutinized the factors contributing to their emergence.Analyses have uncovered coordinated activity among YouTube commenters, potentially influencing engagement levels on specific videos [13,15].Insights gleaned from these investigations have been instrumental in identifying patterns of homogeneity, the development of interconnected communities, and potential biases in recommended systems.
The concept of "drift" is a methodology widely employed by researchers to study how content evolves over time.O'Hare et.al. [10] analyzed a sentiment-annotated corpus to discern topic drift among textual documents.Liu et al. [17] developed an LDA (Latent Dirichlet Allocation)-based method for detecting topic drift in micro-blog posts.Another framework by Akila et al. [16] focused on identifying the mood of India by analyzing real-time Twitter posts.The study visualized emotion trends across the country within a specified date range (4 April 2020 to 4 May 2020) using line graphs and radar maps.The objective was to comprehend regional emotional shifts in relation to reported COVID-19 cases.In our research, we apply drift analysis techniques to evaluate both emotion and morality, aiming to discern patterns of bias in YouTube's recommendation algorithm.By integrating assessments of both emotion and morality, we adopt a comprehensive approach to grasp the nature and impact of video recommendations by YouTube's algorithm.
Extensive research has delved into the nature and impacts of recommendation bias, particularly focusing on its role in radicalization and spreading both misinformation and disinformation [2].Previous studies have explored how such biases contribute to the formation of like-minded groups within online content platforms, like recommended videos, analyzing factors that lead to these groups' creation.Further exploration has identified patterns of coordinated behavior among users in YouTube comments, which may influence the level of engagement with specific videos [12].These insights are vital for understanding the emergence of uniformity, the development of interconnected online communities, and the inherent biases within recommendation systems.The concept of drift has been instrumental in examining the evolution of content, helping researchers assess whether content remains constant or shifts according to certain benchmarks.Studies have employed various methods, including sentiment analysis and Latent Dirichlet Allocation (LDA), to track topic and emotion drift in textual data and social media posts, respectively [14,15].For instance, research analyzing Twitter data aimed to capture the changing emotional landscape of India during the early stages of the COVID-19 pandemic.This paper applies drift analysis to evaluate biases in YouTube's recommendation algorithm by assessing changes in emotion and toxicity offering a comprehensive approach to understanding the algorithm's effects on video recommendations This section is extended to discuss how different approaches have been taken to extract narrative from the content.Researchers have extensively explored Computational Narratology, which focuses on examining narratives from a computational and informationprocessing perspective.It emphasizes algorithmic processes in narrative creation and interpretation, involving formal and computable representations.The study of natural language text narratives typically involves six main stages: pre-processing, parsing, identification, extraction of narrative components, linking components, representation of narratives, and evaluation [4].The advent of pretrained large language models, such as GPT-3, has transformed these processes, allowing them to discern key characteristics and perform various roles across domains without additional training data beyond a prompt.Innovative approaches, like trainable continuous prompt embeddings, [9] have significantly enhanced the accuracy of models like GPT and BERT by 80%.Recent studies [19] have introduced methods for understanding figurative language in both discriminative and generative tasks, bridging the gap between model performance and human understanding.

METHODOLOGY 3.1 Data Collection
YouTube's video recommendations heavily rely on a user's watch history, resulting in a personalized algorithmic selection of suggested videos.To mitigate this personalization bias and ensure experimental control, we implemented several precautionary measures: 1.The video collection script prevented account login for each watch session.2. A new browser instance was initiated for each recommendation depth.3. Cookies from previous recommendation depths were cleared to enable a fresh search for videos at the next depth.
The research data comprised videos from YouTube's 'watch-next' panel, collected following the techniques.In the data collection process, we conducted workshops with subject matter experts to generate keywords related to the South China Sea Dispute.These keywords were used as search queries on YouTube's search engine, generating seed videos.Recommendations were gathered for each seed video across multiple depths, with subsequent depths serving as parents for further recommendations.This process yielded a dataset comprising 9372 videos spanning three tiers of depth.We began with the top 75 most viewed videos, with each subsequent tier increasing by a factor of five.

Transcript Generation
For transcription purposes, we adopted parallel computing and the Python multiprocessing library to enhance the speed of transcript collection from YouTube.This approach utilizes YouTube's Transcript API for extracting YouTube-generated transcripts, and for videos lacking native YouTube transcriptions, it employs OpenAI's Whisper model to generate transcriptions [5].

Narrative Extraction
While most of the researchers work on YouTube's metadata to find out the opinions mined inside it, we went a step forward to extract narratives from the YouTube video transcripts.Since YouTube videos can range from brief to long hours worth of content, it is difficult to extract emotions from those video transcripts.So, to get the embedded stories behind these YouTube videos, we take advantage of large language models like GPT-4 ("gpt-4-0125-preview") which can take up to 128k tokens and incorporate prompts to extract the narratives from our collected video transcripts.As part of the narrative extraction process, we set the temperature parameter to 0, so that the output generated from the model is deterministic.In addition to that, we keep the frequency and presence penalty to 0. As a result, we were able to minimize the production of repetitive sequences of tokens on the generated narratives.Lastly, to get a concise yet informative narrative, we set the    parameter to 25.

Sentiment Analysis
RoBERTa-based sentiment analysis utilizes the RoBERTa architecture, which is a variant of BERT optimized for better performance.It works by training on large datasets of labeled text examples to understand the sentiment expressed in text inputs.RoBERTa captures contextual information enabling accurate sentiment classification into categories like positive, negative, or neutral.Its effectiveness in capturing context led to its widespread adoption in applications such as social media monitoring.

Toxicity Assessment
Detoxify, developed by Unitary AI and available at (https://github.com/unitaryai/detoxify), employs a Convolutional Neural Network trained with word vector inputs to assess whether a given text might be perceived as "toxic" in a discussion.The Detoxify API, when provided with a text input, produces a probability score ranging from 0 to 1. Higher values indicate a greater likelihood of the text being labeled as "toxic." A toxicity score of 0.5 or greater suggests the text falls under the "toxic" category.The Detoxify model returns toxicity scores across seven categories, encompassing the overall toxicity level (1), severe toxicity (2), obscenity (3), threats (4), insults (5), identity attacks (6), and sexually explicit content (7).The rationale for utilizing Detoxify lies in its role as an open-source comment detection Python library designed to identify harmful and inappropriate online texts.

Emotion Assessment
We examined the emotional content within video text data, encompassing titles and narratives, and specifically focused on seven emotions: anger, disgust, fear, joy, neutral, sadness, and surprise.Emotion drift was employed to discern emotional bias across different recommendation depths.The diverse range of emotions arising from the content was visualized on a line graph, with each point on the depth axis indicating a traversed depth of video recommendations.To enhance result accuracy, we utilized a refined version of transfer learning, namely Emotion-English-DistilRoberta-base, for Natural Language Processing (NLP) tasks.

Sentiment Analysis:
The sentiment analysis of YouTube video content indicates that titles tend to maintain a neutral tone, with a gradual shift toward the positive as they become more detailed.This approach may be deliberate, designed to attract a diverse audience while keeping the more emotionally charged narratives under wraps.In contrast, a significant progression from neutral to decidedly positive sentiments is observed in the narratives as they develop, highlighting a level of emotional depth and complexity not evident in the titles by themselves.

Emotion analysis:
Figure 3 and Figure 4 show that the emotion analysis of YouTube narratives and video titles indicates a discernible escalation in the prevalence of joy as the narrative depth increases respectively.This trend suggests an intentional design within the content to elicit progressively positive emotional responses from viewers.Conversely, there is a marked attenuation in the expression of negative emotions such as anger, disgust, and sadness, which could imply a strategic effort to foster and sustain positive engagement throughout the viewer's experience.
A comparative evaluation reveals an approach to the emotional trajectory between video titles and narratives.Titles demonstrate a propensity for initially higher levels of negative emotions, particularly disgust, which may reflect a tactical use of sensationalism to captivate potential viewers at the outset.As the narrative complexity grows, both titles and narratives align in their emotional direction, veering towards increased positively.This alignment underscores a deliberate emotional modulation in content strategy, wherein there is an observable transition from an initial negative skew towards a positive emotional tenor as the content narrative deepens.

Toxicity analysis:
As we delve deeper into the depth of recommendations, we observe fluctuating scores that ultimately reveal an upward trend.On analyzing the narratives generated, there's a noticeable reduction in toxicity.This trend highlights the efficiency of YouTube's recommendation algorithm in promoting videos, with progressively less toxic content as we move further down the depth can be found in Figure 5.

CONCLUSION
In summary, the research findings emphasize a critical oversight in current analytical practices: the reliance solely on metadata, such as video titles, for sentiment and toxicity analysis may lead to incomplete or inaccurate assessments of YouTube content.The study's comparative analysis distinctly shows that while video titles may offer preliminary insights into the content's nature, they do not fully represent the emotional and toxicological depth that narratives provide.
The emotional trajectory within video narratives indicates a marked increase in positive sentiments, particularly joy, which correlates with a decrease in toxicity levels.This contrasts with the less consistent and less emotionally informative nature of video titles.Titles, though useful for capturing initial viewer interest, exhibit a weaker and more variable relationship with toxicity, often failing to reflect the deeper sentiment trends present in the narrative content.
These findings advocate for a shift in research methodologies, suggesting that narratives should be integrated into the analytical framework for a more nuanced and accurate understanding of video content.By doing so, researchers will likely capture a more comprehensive picture of the sentiment and toxicity landscape on YouTube, leading to insights that could better inform platform moderation and even algorithmic recommendation systems.

Figure 2 :
Figure 2: The figure showcases the sentiment trends for YouTube's video titles in different recommendation depths.

Figure 3 :
Figure 3: The figure showcases the sentiment trends for YouTube's video narratives in different recommendation depths.

Figure 4 :
Figure 4: The figure showcases the emotional trends for YouTube's video narratives in different recommendation depths.

Figure 5 :
Figure 5: The figure portrays the emotion drifts for YouTube's video titles over the depths.

Figure 6 :
Figure 6: The figure illustrates the drift of toxicity on both the YouTube video titles and narratives over the recommendation depths.