Shaping the Future of Content-based News Recommenders: Insights from Evaluating Feature-Specific Similarity Metrics

In news media, recommender system technology faces several domain-specific challenges. The continuous stream of new content and users deems content-based recommendation strategies, based on similar-item retrieval, to remain popular. However, a persistent challenge is to select relevant features and corresponding similarity functions, and whether this depends on the specific context. We evaluated feature-specific similarity metrics using human similarity judgments across national and local news domains. We performed an online experiment (N = 141) where we asked participants to judge the similarity between pairs of randomly sampled news articles. We had three contributions: (1) comparing novel metrics based on large language models to ones traditionally used in news recommendations, (2) exploring differences in similarity judgments across national and local news domains, and (3) examining which content-based strategies were perceived as appropriate in the news domain. Our results showed that one of the novel large language model based metrics (SBERT) was highly correlated with human judgments, while there were only small, most non-significant differences across national and local news domains. Finally, we found that while it may be possible to automatically recommend similar news using feature-specific metrics, their representativeness and appropriateness varied. We explain how our findings can guide the design of future content-based and hybrid recommender strategies in the news domain.


Motivation
The abundance of information in today's digital landscape, particularly in news dissemination, underscores the need for tools that can effectively sift through vast content repositories and guide users toward relevant and engaging materials.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
To this end, recommender systems have emerged as crucial instruments, helping to streamline information discovery, optimize content delivery, and enhance the overall user experience [11].
The news domain faces several domain-specific challenges that make the introductions of common recommender system strategies difficult [7,12].Similar-item recommenders are able to circumvent many of these challenges [12].
While such recommenders are popular with news websites, there is limited knowledge surrounding whether the recommendations they represent what users consider similarity between items [28].While there are studies exploring this [28,29], the studies are generally done with limited data, such as using single outlets, a limited number of categories within outlets, and/or a limited amount of news articles.
In this study, we attempt to explore these issues by investigating how feature-specific similarity metrics represent human similarity judgments in four different Norwegian news outlets that span the local and national domains.The primary objective is the analysis of human similarity judgment representations by feature-specific similarity metrics across local and national levels of Norwegian news outlets.Additionally, the study aims to assess the efficacy of a set of feature-specific similarity metrics, derived from recent advancements in language technologies, in comparison to traditional measures of similarity for news articles.Finally we also evaluate how well similarity by itself represent the users' desired recommendation.

• RQ1:
To what extent do feature-specific similarity metrics represent human similarity judgments in the Norwegian news domain?
• RQ2: To what extent does the correlational strength between human similarity judgments and feature-specific similarity functions differ across local and national news media outlets?
• RQ3: To what extent are human similarity judgements reflected in perceived recommendation appropriateness?

Contributions
The goal of this study is to explore and evaluate feature-specific similarity metrics and whether they represent human similarity judgments in the Norwegian news domain.By doing this we do the following contributions: • An extension of metrics used in [27,28,31], examining to what extent current state-of-the-art NLP methods represent human similarity judgments.
• A novel comparison of metric performance across pairs of national and local news outlets.
• The inclusion of a user evaluation study, examining the appropriateness of different strategies.

News Recommender Systems
Many news recommender system use 'more like this' recommendations.Such Similar Item Retrieval aims to provide an unseen or novel item that is similar to a specific reference item [28].A key question is how to compute the similarity between the base item and candidate items to be retrieved.[20,33].
Similar Item Retrieval is typically performed through content-based recommendation (CB) methods [12].While collaborative filtering (CF) and knowledge-based recommenders are common in other domains [11,22], they are typically not used in the news domain.One of the main reasons is the permanent cold-start problem [12], which arises from the lack of historic information from users.In news, this is due to the large number of one-time and first-time users that do not log in.Further compounding the problem is the high frequency of novel items, along with the high volatility of a news article's relevance and contextual factors, such as the time of day and the user's location [12].It seems that such issues are avoided by using CB algorithms: In their survey, Karimi et al. [12] show that 104 out of 112 reviewed articles on news recommenders use CB algorithms or hybrid algorithms with a CB component.
Similarity-based approaches can leverage feature-specific similarity metrics.Among NRS features, these usually involve evaluating the article's text or title, while other features are ignored [12].The assumption here is that these features are paid most attention to and should therefore determine similarity scores, which is, however, typically not validated [30,35].A traditional method to compute the similarity between text items is by deriving vectors from the text [28].Term Frequency-Inverse Document Frequency (TF-IDF) remains one of the most commonly used IR methods to create similarity vectors from text [2][28].
While TF-IDF is still popular, it has been outperformed by other metrics, such as BM25 [19][28].In recent years approaches using transformer models and Word2Vec also show better performance than TF-IDF on text similarity tasks [4,17].Since the introduction of transformer models with the Bidirectional Encoder Representations from Transformers (BERT) model in [32], the use of such models has received immense popularity.In recommender systems there are several approaches utilizing the embeddings provided by various transformer models [10,13,36], and combining transformer models with topic modeling techniques [18,34,37].These have, however, not been used in recent studies on similar-item retrieval and feature-specific similarity [28].
Recommender systems are typically evaluated through offline experimentation and simulation based on historical data, through laboratory studies, or through A/B (field) tests on real-world websites [12].In their survey Karimi et al. [12] found that a large majority of studies relied on traditional IR measures like precision and recall, rank-based measures like Mean Reciprocal Rank or Normalized Discounted Cumulative Gain, or prediction measures like the Root Mean Square Error.These methods all rely on a dataset annotated based on the task the recommender is meant to solve.However, such datasets are not readily available in the news domain [12].
While only 19 of the 112 papers surveyed by Karimi et al. [12] utilize it, click-through-rate (CTR) is a popular way to evaluate the performance of news recommenders [8].However, CTR is not helpful in determining if the items are similar, as the user may click on the item for other reasons than similarity [23].

Related Work
In order to validate the performance of similar-item recommenders, human judgments are typically used [3].A critical question is to what degree similarity functions mirror a user's judgment of the similarity between pairs of items.
Problems could arise if a user undervalues or overemphasizes specific item features compared to which is calculated, and how the similarity is being calculated [28,33].
Yao and Harper [35] collected human similarity judgments using movie pairs collected from the MovieLens1 dataset.
As part of their study, users are asked to what extent the movies are similar, and whether they would recommend the second movie to someone who likes the first.Their goal was to explore whether CF or CB algorithms provide similar item recommendations that are closer to human similarity judgments.Yao and Harper [35] suggest that CB algorithms perform better in matching human similarity judgments.Another key observation in Yao and Harper [35] is that similarity is not everything in a similar item recommender: Over 60% of the users in their survey choose a compromise over being recommended the most similar item.
Other studies where human judgments have been collected in order to evaluate similar item recommenders include Trattner and Jannach [31], Starke et al. [28], and Solberg [27].This study builds directly on the work done in these studies.The main methodology of calculating feature-specific similarity metrics and comparing them with human similarity judgments used in this study is introduced by Trattner and Jannach [31].Starke et al. [28] then applies the same methodology to the news domain.Similar to Yao and Harper [35], Solberg [27] attempts to discover news recommender criteria, before he uses a similar methodology to that of Trattner and Jannach [31] and Starke et al. [28] to examine differences between categories in the news domain.
In the initial work by Trattner and Jannach [31] two main studies are performed across the movie and recipe domains.
The studies follow a novel approach where the goal is not to evaluate existing algorithms, but to develop new similarity functions from human similarity judgments.The human similarity judgments are used as baselines for how similar the items are, and what makes the two items similar.Trattner and Jannach [31] also asks the users which similarity cues the users used while evaluating the similarity.These similarity cues represent the features the feature-specific metrics are based on.
In Starke et al. [28], a similar approach to Trattner and Jannach [31] is employed, but this time in the news domain.
They use a total of 2400 articles are included, with 400 articles from the 'Politics' category are randomly sampled from each year between 2012 and 2017 TREC Washington Post dataset2 .Following the method put forward by Trattner and Jannach [31], a survey was conducted to collect human similarity judgments.The obtained similarity judgments exhibited low correlations with the metrics across all aspects, with an average Spearman correlation coefficient of 0.092.
Among the metrics, the highest correlating one was TF-IDF when applied to body-text, demonstrating a correlation coefficient of 0.29.Several prediction models were then trained based on the data from the survey to create a specific news recommender algorithm.
In his thesis, Solberg [27] builds upon this by addressing two primary problems.The first problem focuses on defining the criteria for news recommendation, while the second problem aims to explore the differences between specific news categories, namely Sports and Recent Events.His thesis is divided into two separate studies, each addressing one of these questions.Similar to Yao and Harper [35], he shows that only 26 of the 45 participants in the study selected item similarity as a factor.While this was the most common response, it does show that similarity may not be the primary goal of a news recommender [27].He then used insights from the pre-study, particularly regarding categories, to conduct a similar study as in Starke et al. [28].The study shows some minor differences in how feature-specific similarity metrics perform across categories.

Key Differences
The use of similar-item retrieval can overcome recommender problems in the news domain related cold start and item [12] Past studies in this context have examined the use of feature-specific similarity functions on news articles from specific corpora in the USA, such as the Washington Post [28].These employ the method of semantic similarity [30], where users are asked to judge the similarity between two items and to compare this to a computational approach of similarity.Previous work faced a number of limitations.Beyond the use of a limited number of news content, the metrics tend to be relatively simple (e.g., TF-IDF) not reflecting the state of the art.Moreover, there has been little attention for the context of news article, be it whether they are part of a local or national outlet.For example, local news might be geared towards links with specific communities (cf.[26]), using various named entities to emphasize these links.Finally, although the method of semantic similarity is a form of 'user validation', previous studies have not evaluated the recommendation appropriateness using quantitative methods [27,28].
Shaping the Future of Content-based News Recommenders UMAP '24, July 1-4, 2024, Cagliari, Sardinia, Italy Uniquely, this study investigates feature-specific similarity functions using human judgments for Norwegian language news.This is a first in this domain where previous investigations have been conducted primarily for English language news.This detailed analysis includes not only national-level news, as previous studies have done, but also local-level news, allowing for a more nuanced view of different outlet levels.In terms of metrics, this study applies recent developments in Natural Language Processing (NLP) to evaluate their effectiveness in representing human similarity judgments.This provides novel insights into the capabilities of current state-of-the-art NLP methods, an aspect overlooked in previous work.

Dataset
The dataset used for this study is a combination of data from four separate outlets from two separate media organizations 3 .The datasets were obtained through the MediaFutures research center 4 and consist of outlets from two of the MediaFutures industry partners, Amedia5 and Schibsted6 .The datasets followed the following criteria: • Contain Local and National news.The main research question of this study is to find any differences between Human Similarity Judgments between the National and Local news domains.Available large-scale datasets were considered, but none were found to have the sufficient geographical granularity to isolate a clear local news domain.Because of this, it was decided that a specific dataset would have to be obtained or created.
• Participant availability.One challenge identified early on was the potential struggle of obtaining participants for the Human Similarity Judgment survey.Considering that a local news domain would also require local participants for the survey, overly restricting the definition of local, or restricting it to an area where potential participants are difficult to contact, could create unwanted challenges.Because of this, the local domain was chosen to be the Bergen area.As a result of this, the national domain is Norway.
• Recency.In the news domain time is a very important factor.The lifespan of breaking news is generally very short, down to a few hours [5,6].To avoid the problem of recency affecting the similarity ratings, we avoided recent news but avoided news older than one year.Because of this, we collected news articles from 2022.
• Comparable Features.Since this study builds upon previous studies [27,28,31], we performed comparative analyses.The features selected are therefore either aligned with previous work or novel (cf.Section 3.2).
3.1.1Outlets.The dataset includes articles from different Norwegian news sources.These stem from two different news organizations, from which we selected both a local and one national newspaper.For Amedia the datasets include the outlets Bergensavisen (BA) and Nettavisen.BA is the most local newspaper across the dataset, with its main audience in Bergen and surrounding areas.Nettavisen functions as the national newspaper in the Amedia context of the dataset.
Its audience is all of Norway, and ranks 7th in daily online readership.[27,28].Articles were filtered using available journalist tags, with manual review to ensure effectiveness.This approach also helped eliminate periodical articles and those with high similarity within certain tag groups.In addition, we removed incomplete articles, such as those without images and key features as listed in Table 2. Short and long articles were also omitted, removing those with body texts shorter than 1000 characters or longer than 10000 characters were excluded, amounting to the 3% shortest and longest in the dataset.Finally, within each outlet, articles with duplicate titles and text were also removed.Key figures of the datasets after cleaning can be seen in Table 1.

News Article Features
The selection of features was based on earlier work [27,28,31], of which a list is presented in Table 2.A main difference with earlier work was the section feature.In Starke et al. [28] the category feature was used to represent a subcategory, while in Solberg [27] a feature named topic had similar properties.Where in both studies articles were limited to a single parent category, the current study included multiple categories across across entire outlets.In the Schibsted datasets, this was called section, while Amedia utilized a feature named predicted category.The section feature in this study had a higher granularity than simple categories, which could usually be mapped to a parent category.Another difference is the tag feature, which were added in both the Amedia and Schibsted datasets manually added by the newsrooms, and represented the news content.

Metrics
As our work builds directly on top of the work done in [31] [28] and [27], several of the metrics used are shared with them.A full list of the similarity metrics and the features they are used on can be seen in Table 3.
Shaping the Future of Content-based News Recommenders UMAP '24, July 1-4, 2024, Cagliari, Sardinia, Italy When calculating the similarity of the Image metrics we used used a similar approach as [31], [28] and [27].Similarity is compared based on Brightness, Sharpness, Contrast, Colorfulness and Entropy.To compute similarity, the individual low-level feature was calculated and then compared using Manhattan distance.As in [28,31], the low-level image features were extracted using the OpenIMAJ library8 as proposed by San Pedro and Siersdorfer [24] [31].
In addition to the low-level features, Image Embeddings were also extracted.Following the method proposed by [25] and also used in Trattner and Jannach [31] and Starke et al. [28], we used an embedding from the first fully-connected layer of a pre-trained (ImageNet) VGG-16 model.
Following the method used in Starke et al. [28], text similarity was calculated using two TF-IDF, as well as LDA topic modeling.In addition to the two TF-IDF algorithms used in [28], an algorithm utilizing lemmatized text was also used (TF-IDF-L), based on findings in Balakrishnan and Ethel [1].Three metrics utilizing pre-trained large language models were also used.Following findings in Solberg [27], named entities were extracted and a metric utilizing Jaccard similarity was devised (NENTS).In addition to LDA, topics were modeled using BERTopic [9], the similarity metric for BERTopic compared vectors of topic predictions using cosine similarity.Finally, text embeddings were extracting using a pre-trained Sentence Transformer (SBERT) model [21] 9 and compared using cosine similarity.
Similar to Starke et al. [28], the title similarity was evaluating using 4 edit-distance based metrics, as well as LDA topic modeling and TF-IDF.In addition we used the Sentence Transformer, BERTopic and Lemmatized TF-IDF metrics, which were also used on the main article text.
In line with Starke et al. [28], Section similarity was calculated using Jaccard similarity.In addition, similarity of the publication date was calculated by comparing the difference in publication date, divided by the total date range of the dataset.Finally, tags-similarity was calculated using Jaccard on the list of tags for each article.

Experiment
3.4.1 Procedure.Users were invited to join a study on news recommendation and similarity 10 .Upon starting the survey, they were first randomly assigned to a group of either Amedia context or Schibsted context.Once assigned, we semi-randomly formed 10 article pairs, which would be presented to each user: 5 from the local media outlet and 5 from the national outlet.Each pair belonged to a specific sample bin outlined in section 3.4.2.
For each pair, users needed to rate the similarity between the two news articles on a 5-point scale.As in [27,28], the users were also asked about their familiarity with the presented articles and the confidence they had in their similarity ratings.In order to explore recommendation appropriateness, we also asked the users to what extent they would agree with the statement that they would like to be recommended article 1 after seeing article 2, and vice versa.In addition, we also inquired on basic demographics and news use frequency.

Sampling
Strategy.The pairs were formed using methods similar to Starke et al. [28].As outlined in section 3.1, the dataset was divided by outlet, and the 25 metrics (Table 3) were applied to each subset.This resulted in four similarity score matrices for each of the outlet's news article pairs, using equal weight calculations.To avoid problems with low similarity strength, as observed in [27,28], we used a strategy that placed news articles in similarity strength 'bins'.We computed the standard deviation of the pairwise similarity scores and then divided pairs into the following sampling bins: UMAP '24, July 1-4, 2024, Cagliari, Sardinia, Italy Rosnes et al.
Table 3. Full list of similarity metrics and the features they are applied to.Metrics not used in [31] or [28] are denoted by *. (2) Pairs between 2 and 1 standard deviation below the mean similarity strength.

Image
(3) Pairs between 1 standard deviation below the mean and 1 standard deviation above the mean similarity strength.
(4) Pairs between 1 and 2 standard deviations above the mean similarity strength.
(5) Pairs above 2 standard deviations above the mean similarity strength.
For each media outlet, we sampled one pair from each bin.The results of applying this strategy to the pairwise similarity scores can be seen in Table 4. Once the scores were divided into groups, 1 000 pairs were randomly sampled from each bin for each outlet and added to the survey database.This resulted in 5 000 pairs for each outlet and 20 000 pairs available in total.119 out of 141 participants, or 84.4%, passed the attention check.After accounting for the attention check, ratings for 1071 news pairs (featuring 1968 unique news articles) were available from users who passed the attention check.The final figures for the segmentation of participants and pairs are described in Table 5.The results are calculated using only the participants and pairs that passed the attention check.In addition, the pairs that had the attention check are removed as the attention check interfered with the ratings given 11 .
A total of 112 participants, 79.4%, reported their frequency of news reading to be approximately every day.This is higher than in the previous work, and somewhat higher than expected.81 participants were male while 59 were female.

Comparing Metrics to Human Judgments (RQ1)
We examined the extent to which Feature-Specific Similarity Metrics relate to Human Similarity Judgments.In order to compare the Similarity Metrics to the Similarity Judgments, Spearman correlations were computed between the metrics listed in Section 3.3 and the Human Similarity Judgments collected through the survey.The results per metric are described in Table 6, which are also divided on local vs national domains and outlet (to address RQ2 later).
We discuss Table 6 from top to bottom.Among the Image-based metrics, Image:EMB demonstrated the highest correlation to Human Similarity Judgments, registering a correlation of 0.30.This correlation was especially high for 11 The attention check replaced the body text with a message to give ratings of 3 on all parameters if the text was read.strength was the weakest when comparing the most local and most national outlet.The Section:JACC metric also show high strength on some divisions.But it should also be considered that this metric shows weaker correlation when considering the ratings for VG alone.Finally, the BERTopic showed similar results across all divisions of the outlets.It also had the highest strength when evaluating VG and BA.However, it was only significantly higher when evaluating VG vs BA.To investigate this further, we also evaluated the Title:BERTopic z-score between Nettavisen and VG which returned a z-score of -1.788 with a -value of 0.074.

RQ3: Recommender Appropriateness
We finally examined the user's perceived recommendation appropriateness, in relation to the inter-article similarity.
This was based on whether users would like to be recommended one of the articles in a pair after seeing the other.
The results are described in Table 7.It was observed that the overall Spearman correlation between similarity and recommender appropriateness is 0.54, which suggested a moderate relation between similarity and appropriateness.
Most notably, the score for appropriateness increased per similarity strength bin, except between bins 1 and 2.
The final column of Table 7 describes the symmetry of the appropriateness rating.This meant whether the appropriateness rating for liking article 1 after 2 was similar to the rating for 2 after 1.We found this correlation to be relatively high: 0.84.

DISCUSSION & CONCLUSION
5.1 Representativeness of Feature-Specific Similarity Metrics (RQ1) We have examined to what extent different feature-specific similarity metrics represent human judgments of similarity.
The goal is to identify metrics can be used in content-based recommenders that users like to use, for they represent their judgment and preferences.
One of the primary findings is the effectiveness of the BERT-based metrics for news recommendation.Particularly SBERT, which has not been used often in this context [14], shows higher correlations than the other metrics on both of the features where it is used and also the highest correlation across all metrics when it is used on the body text of the article.This is surprising considering the basic implementation, including a limitation of the first 512 words of the article.This is lower than the median amount of words per article in the dataset.SBERT is primarily designed to create embeddings for sentences, and that may explain the higher relative correlations in the title feature than the text feature when compared to TF-IDF.
The BERTopic metrics also showed comparably high correlations, especially on the title feature where it is the second-highest correlating metric after SBERT when considering all ratings.Considering the VG and BA news outlets we see that the range of correlations is fairly high.When we also consider BT and Nettavisen, and the size of the various datasets, it may indicate that BERTopic's correlation decreases based on the number of articles in the dataset.This is

Table 1 .
The Schibsted outlets included are Bergens Tidende (BT) and Verdens Gang (VG).BT is the largest newspaper of Western Norway, with its base in Bergen.Its audience is all of Vestland county.In the dataset BT is the local newspaper for the Schibsted context.VG is Norway's largest online newspaper by readership, and its audience is all of Norway.It functions as the national newspaper in the Schibsted context.Figures for the outlets can be seen in Table1.Statistics of the outlets in the dataset: Q4 2022 Norwegian readership ranks and daily readership 7 for online versions, the raw and cleaned amount of articles, number of sections, average number of tags, average amount of tokens in the body text and titles.

Table 2 .
News article features used in study.
3.1.2Dataset Cleaning.The final dataset contained 36,768 articles which were all published in 2022.The 'raw' dataset was larger (cf.Table 1); to increase the dataset's similar pair diversity, we removed articles on dominant topics like Covid-19, the War in Ukraine, and the Power crisis, based on insights from

Table 4 .
Amount of pairs and percentages per sample bin.Bin 1 is least similar and bin 5 is most similar.

Table 5 .
Segmentations of the participants and pairs for the analysis in the chapter.The pairs in the pass groups include the removal of the attention check ratings.Participants are divided into Local and National groups depending on their reported place of residence.Bergen and Bergen Area are considered Local.Participants were recruited by sharing the survey link across relevant social media channels.In total 329 participants started the survey with 143 completions.2 of the participants were below 18 years old and were removed from the results, bringing the total number of participants to 141. 73 of the participants completed the Schibsted context, giving ratings to pairs from BT and VG, while 68 completed the Amedia context, giving ratings to pairs from BA and Nettavisen.

Table 6 .
[27]larity metric correlation (Spearman) with human similarity judgments.Metrics are listed in the left column, with Spearman correlations for the various divisions of the datasets listed in the other columns.All combines the pair ratings of all outlets.National combines VG & Nettavisen, Local combines BT & BA.For the features with several metrics, the metric with the highest correlation can be seen in bold.*<0.05,**  < 0.01, * * *  < 0.001.Text:SBERT metric (0.60) presented the highest correlation across all divisions of the dataset.This would suggest that SBERT on body text was most representative of human similarity judgments.This outperformed the Text:TF-IDF metric (0.47), which was the highest correlating metric in studies of Starke et al.[28](0.29)andSolberg[27](0.53).The Text:TF-IDF-L metric showed similar correlations as the Text:TF-IDF metric.The Text:BERTopic metric (0.40)