BroadcastSTAND: Clustering Multimedia Sources of News

News reaches a variety of audiences through any number of mediums, from traditional newspaper publications to social media posts to radio and TV broadcasts. These varied mediums present valuable research opportunities to gain insights into emerging trends. We present BroadcastStand, an extension framework for the NewsStand architecture, which traditionally focuses on online news articles and Twitter posts. Our objective is to seamlessly integrate a new genre of news data: radio and TV broadcasts. We show how these transcripts fit into the previous clustering landscape of traditional news data and highlight key insights that will drive future research in aggregating various news sources to increase the dimensions across which analysis can be done. In particular, we highlight the value in clustering these various sources of news data and highlight certain key pitfalls that must be addressed to obtain good clustering results.


INTRODUCTION
In today's media-rich landscape, news takes on various forms, disseminating through a multitude of channels, spanning from oldschool newspapers to social media posts and the airwaves of radio and TV.The constant flow of news through these diverse mediums presents valuable research opportunities in our information-hungry society.These data sources can be leveraged to gain insights into emerging trends, public sentiment, and geographical variations in news coverage.We present an extension framework to the current NewsStand architecture 1 , mainly focused on online news articles [10,[18][19][20]24], tweets [3,5,6,8,21], photos [17] and music lyrics [9].We seek to broaden the scope of this architecture by seamlessly incorporating a new genre of news data: radio and TV broadcasts.This expansion seeks to encompass a more diverse spectrum of news content and bridge the gap in the existing architecture by incorporating broadcast news data.To achieve this, we gathered an extensive corpus of broadcast news transcripts from PBS NewsHour spanning a wide spectrum of topics and events as a source of broadcast news information.Then, by harnessing the existing infrastructure of the NewsStand architecture, we extended the framework to include processing and indexing mechanisms specifically designed for broadcast news data.Our overarching objective was to assess the viability of incorporating broadcast news data within the NewsStand framework.
We conducted a preliminary clustering analysis on the transcribed content to identify coherent clusters representing news topics or events at a specific location.The results yielded an unexpected outcome-an outcome that diverged markedly from our initial expectation of disparate clusters representing distinct news topics or events.Rather, the clustering algorithm consistently grouped all broadcast news transcripts into a single cluster.This intriguing result prompted us to critically reassess the effectiveness of our clustering approach when applied to broadcast news content within the NewsStand architecture.In response to these findings, we introduce BroadcastStand, an extension that seamlessly integrates radio and television broadcast transcripts into the existing NewsStand landscape.In this demonstration, we show how these transcripts fit into the established clustering landscape of traditional news data and highlight key insights that will drive future research in aggregating various news sources to increase the dimensions across which analysis can be done.In particular, we highlight the value in clustering these various sources of news data and highlight certain key pitfalls that must be addressed to obtain good clustering results.
In today's ever-changing media landscape, broadcast news data is becoming increasingly vital, particularly in light of the surging popularity of podcasts.Podcasts have emerged as a convenient means to stay well-informed, especially during daily commutes.Additionally, an increasing number of individuals are turning to alternative platforms like YouTube channels for their news updates.These YouTube channels often feature news content delivered in audiovisual formats with captions.This shift in news consumption highlights the need for NewsStand to seamlessly integrate content from both traditional radio and online broadcast platforms for the organization and delivery of news content.The integration of broadcasts and podcasts as primary information sources holds pivotal importance in today's dynamic media environment.The emergence of podcasts, accompanied by the growing availability of online news programs, has breathed new life into radio news data.The inclusion of radio broadcast data within the NewsStand architecture broadens the spectrum of news sources available to users.This expansion not only diversifies the range of news perspectives but also enriches the depth of news coverage, ensuring a more comprehensive understanding of current events.Moreover, radio broadcasts offer a unique audio-based format that introduces a distinctive dimension to news consumption.By incorporating radio data into the NewsStand architecture, users can access news content in an auditory format, catering to those who prefer listening to news rather than reading.This inclusive approach ensures that the NewsStand system accommodates a broader range of user preferences, thereby enhancing the overall user experience.Furthermore, radio broadcasts frequently feature interviews, discussions, and expert analyses that offer valuable insights beyond the scope of written articles.Through the integration of radio broadcast information, the NewsStand architecture provides users with access to rich and in-depth commentary, thereby deepening their understanding and engagement with news topics.Lastly, the digital transformation of radio broadcasting has made it easier to access and analyze radio content.Digital archives and advancements in audio processing techniques allow for efficient extraction and integration of radio broadcast data within the NewsStand architecture.Leveraging existing content analysis and clustering methods developed for textual news, the system can adapt and incorporate broadcast data seamlessly.

TRADITIONAL NEWS PROCESSING
NewsStand [24] is a framework designed to for news organization and retrieval.The key functionality of NewsStand is to provide users with a personalized and dynamic news browsing experience.The framework incorporates techniques such as content analysis, clustering, and recommendation algorithms to organize news articles based on their content and user preferences.The data processing pipeline of the NewsStand architecture involves collecting news articles from various sources, extracting key metadata, and performing content analysis to identify important entities, topics, and events.The clustering module groups related articles into coherent clusters, enabling users to explore different perspectives and indepth coverage of specific news topics.As a whole, the NewsStand architecture presents an innovative approach to news organization and retrieval.It showcases the importance of personalized and contextually relevant news browsing experiences and illustrates how the framework addresses these requirements.

Preprocessing
The NewsStand dataset we use in this study is derived from a traditional news aggregation pipeline that is set up in stages, allowing articles to be processed and ingests in real time.The preprocessing of the articles is done in stages: Entity Feature Extraction, Gazetteer Record Assignment, Geographic Name Disambiguation, and Geographic Focus Determination [24].This is known as geotagging and consists of the following stages.The first stage involves identifying important entities in the text and collecting them in an entity feature vector (EFV).This is accomplished using a combination of Part-Of-Speech (POS) tagging and statistical Named-Entity Recognition (NER) tagging [25].The NER tagger is from the LingPipe toolkit, which was trained on the Brown corpus and additional news data.Once extracted, the EFV contains words belonging to proper noun classes like location, organization, and person.Since location entities are particularly relevant for geotagging, those are marked as geographic features in the EFV and then assigned a set of matching locations during the Gazetteer Record Assignment stage.The toponyms are resolved during the Geographic Name Disambiguation phase [11][12][13]16, 22],

Clustering
Clustering techniques are crucial in grouping news articles by their thematic relevance and temporal aspects, ensuring relevant content is readily accessible and up-to-date.Numerous techniques tailored for news clustering have been introduced.These techniques include a variant of the k-means algorithm by leveraging WordNet hypernyms [1], topic modeling [7,26], and density-based clustering [4].NewsStand, for instance, employs the vector space model [15].Each article is transformed into a term feature vector based on its term frequencies [14].Owing to NewsStand's online nature, each article's term feature vector is computed upon its ingestion into the system, and the clustering is also done in an online fashion using a variant of leader-follower clustering [2].The articles undergo clustering based on both the term vector space and the temporal dimension.Each cluster possesses a term centroid and a time centroid, indicative of the average term feature vector and mean publication time of its articles, respectively.For each new article, the cosine similarity [23] is evaluated to determine if an existing cluster's centroid falls within a cutoff distance from the article's term and time attributes.If so, the article is assigned to the nearest cluster, and the centroids are updated.Otherwise, a new cluster is initiated for the article alone.

BROADCASTSTAND
Analyzing data across diverse media forms, especially the constantly evolving and unstructured nature of news data, presents several significant challenges.Obtaining reliable transcripts for radio and TV broadcasts is a challenging task due to various factors.One of the primary difficulties lies in the process of obtaining the transcripts themselves.Unlike text-based news articles, broadcasts do not inherently come with accompanying transcripts.Radio and TV stations may not always provide official transcripts for their broadcasts, making it necessary to find alternative methods of acquiring the spoken content.One common approach is to use automated speech recognition (ASR) systems to generate transcripts automatically.However, auto-generated captions from ASR technology are often not entirely reliable.ASR systems can struggle with accurately recognizing and transcribing speech, especially in the context of broadcasts.The presence of multiple speakers, background noise, accents, and the informal nature of speech can introduce errors and inaccuracies into the transcribed text.Consequently, relying solely on ASR-generated transcripts may result in incomplete or unreliable representations of the original audio content.There are also inconsistencies in how different broadcast stations or websites embed their transcripts, if they are available at all.As a result, obtaining accurate and consistent transcripts often requires manual intervention and adaptation to specific website structures.For the purposes of this study, we leverage PBS NewsHour transcripts, which are readily available in a format that require relatively little upfront processing to extract the main content.PBS NewsHour also stands out as a credible and accessible news source that is broadcast on over 350 PBS member stations and networks, making it a good choice for demonstration purposes.

METHOD AND RESULTS
We incorporated PBS NewsHour transcripts into the NewsStand processing pipeline and analyzed the resulting news clusters.We gathered transcripts of radio and TV broadcasts from PBS New-sHour 2 and reprocessed the text data to reduce noise and standardize the text.Preprocessing steps included converting text to lowercase, removing punctuation, stop words, and special characters, and performing stemming.The preprocessed text data were then converted into TF-IDF (Term Frequency-Inverse Document Frequency) vectors, capturing word importances in each document.Finally, the pairwise cosine similarity was computed between the TF-IDF vectors of the documents to measure the document similarity, where higher values indicate greater document similarity.Following this process, we observed three types of clusters from the NewsStand corpus, listed below.
• Broadcast-Only Cluster: This category of cluster contains data points that exclusively belong to broadcast data.• Mixed Cluster: This category of cluster contains data points from both broadcast non-broadcast data sources.• Non-Broadcast Cluster: This category of cluster contains no data points relating to broadcast data.
Given the nature of the high dimensionality inherent within the TF-IDF vectors, we first employed Principal Component Analysis (PCA), followed by t-Distributed Stochastic Neighbor Embedding (t-SNE) for visualization.As depicted in Figure 1, this visualization reveals a clear spatial separation among these clusters, with 2000 broadcast files and 2000 non-broadcast clusters randomly sampled from the NewsStand corpus.The broadcast-only clusters are prominently located within a specific region, while the non-broadcast clusters are noticeably distant from this region.Specifically, only 5 mixed clusters contained broadcast data, signifying the difference between broadcast and print articles.Of the 2000 files that were assigned a cluster in NewsStand, 1995 of them were isolated in clusters of size 1, meaning they were not deemed similar enough to any other previous article to represent the same news story.These distinctive clustering patterns highlight that the clustering of the broadcast data greatly differs from the clustering landscape in the NewsStand corpus.
We evaluated the quality of the clustering results obtained using precision and recall metrics.We hand labeled3 a sample of 300 radio transcripts prior to ingesting them into the NewsStand architecture.These labels were assigned by a single annotator using a series of keyword terms to represent the specific topic of the news event discussed in the transcript.Any number of topic tags were assigned to each transcript, depending on the length and content, as appropriate.Using these ground-truth labels, we then compared NewsStand's clustering of these data points to compute the precision and recall of the clustering algorithm in the noisy NewsStand environment.To compute precision, we examined NewsStand's broadcast-only and mixed clusters.Data points NewsStand grouped into the same cluster were considered "positives, " and true positives (TP) were identified by finding the largest clusters with shared manual keywords.Clusters formed using manual annotations represent "all relevant instances" (True Positives + False Negatives).To compute recall, we determined the largest NewsStand clusters count for each hand labeled keyword tag.For example, in the "King Charles" cluster, the maximum cluster size was 1, although 3 transcripts were manually tagged as being news stories about King Charles, resulting in a recall of 33.3%.shows the recall performance for NewsStand's clustering on the broadcast data.The average recall for all the keywords was only 28.46%, while the average precision for all clusters was 99.74%.A very high precision and low recall indicate that the clustering results may be overly cautious, not grouping data points together unless they are quite similar and certainly refer to the same news event.This suggests that while the clustering method is effective at grouping related articles together, it may also be prone to not including potentially relevant articles within those clusters.

CONCLUSION AND FUTURE WORK
BroadcastSTAND underscores the limitations of applying News-Stand's standard text clustering algorithm to broadcast data, even when articles and transcripts share similar topics and key phrases.The noticeable difference in clustering precision and recall outcomes highlights potential challenges for effective clustering associated with broadcast data, e.g., the informal nature of spoken content.In future research, we aim to investigate a broader similarity definition for enhancing the identification of related content within both broadcast and news data.

Figure 1 :
Figure 1: Visual representation of the clusters in the NewsStand corpus, showcasing the distinct characteristics of broadcast, non-broadcast, and mixed clusters, with individual data points scaled to show the size of the cluster, in the number of documents.

Figure 2 :
Figure 2: NewsStand recall on broadcast transcripts per hand-labeled news topic.