A Clone-based Analysis of the Content-Agnostic Factors Driving News Article Popularity on Twitter

The significant impact of Twitter in news dissemination underscores the need to understand what drives tweet popularity. While the content of an article plays a role, several "content-agnostic" factors also influence tweet popularity. Previous studies have faced challenges in differentiating the effects of content-agnostic factors from content variations. To address this, the paper presents a comprehensive analysis of tweet popularity using a "clone-based" approach. The methodology involves identifying tweets linking the same or similar articles (clones) and studying the factors that make some tweets within clone sets more successful in attracting retweets. The analysis reveals insights into clone set characteristics, winners' success patterns, retweet dynamics over time, domain-based competition, and predictors of success. The findings shed light on the complex nature of popularity and success in social media, providing a deeper understanding of the content-agnostic factors that influence tweet popularity.


I. INTRODUCTION
With all major news outlets actively promoting their news on Twitter and the majority of all Americans receiving their daily news from social media [1], Twitter has come to play an important role in the dissemination of news.Due to the significant influence of social media, understanding the factors that make a tweet popular is therefore increasingly important.
However, determining the factors that contribute to a news article's popularity on Twitter, and even more so determining the content-agnostic factors that impact the retweetability of a tweet linking to such an article, remains a complex task.While the article's content, such as its interest, relevance, and quality, is important [2], it is widely acknowledged that several "content-agnostic" factors also influence popularity.For example, in the case of news articles shared on Twitter, content-agnostic factors like the number of followers of the poster or the length of the tweet can impact how many times such a tweet is retweeted and therefore how frequently it is included in other users' personal feeds or search results.
Previous studies have explored tweet popularity, examining static and temporal properties of retweet counts.However, understanding how content-agnostic factors impact popularity Timeline has remained challenging.For example, news outlets with large social networks may appear more popular because they typically share links to more interesting content, not due to the direct influence of social network size on tweet popularity.
Existing studies with datasets containing tweets with links to news articles of diverse content struggle to rigorously differentiate the effects of content-agnostic factors from those arising from content variations.This is a significant shortcoming since not all news articles are the same, and the factors that impact the successful promotion of a news article on Twitter may be heavily influenced by the interest in the article itself.To address the above shortcomings, in this paper, we present a comprehensive analysis of tweet popularity that accounts for the articles that are shared.In particular, we present a "clonebased" data collection and analysis (inspired by our prior work using clones to study YouTube popularity [3]), in which we first identify tweets linking the same article or a very similar article (which we call a "clone" of the original article), and then study what makes some of the tweets within such clone sets more successful in attracting retweets.Our clone-based methodology offers important insights into content-agnostic factors affecting tweet popularity.As concrete examples, we next list ten example findings: (1) Clone set sizes and the number of website domains responsible for publishing the articles show highly skewed distributions, with our results suggesting that a considerable portion of news stories are both replicated across numerous outlets and widely shared on Twitter.(2) The winners of big clone sets tend to receive more retweets, with the number of retweets following a power function.(3) Winners are predominantly posted early, but the first mover does not always obtain the most success.(4) Most clone sets link to a single domain, with the clone sets with most clones linking to NY Times, Forbes, and Bloomberg.(5) In clone sets with clones from different domains, Reuters was the most frequent winner, outperforming other domains.(6) The success of domains posting clones varies when competing against each other, with some domains frequently losing and others frequently winning.( 7) The tweeter's characteristics play a significant role in the success of a clone, with winners and first posters typically having more followers and higher listing counts.(8) The length of the tweet text also influences success, with winners tending to use longer tweet texts.(9) Excluding public metrics such as likes, quotes, and replies (which also measure a tweet's popularity), the user follower count and the user verified status are the most important predictors of success, followed by tweet-related factors like the tweet count of the user and the tweet length.(10) Except for a smaller variation with Forbes, our domain-based analysis shows similar patterns across domains.
Overall, the study highlights the influence of various factors, such as clone set characteristics, winners' success patterns, retweet dynamics over time, domain-based competition, and predictors of success, shedding light on the complex nature of popularity and success in social media.
Outline: After describing our methodology for data collection and clone identification ( §-II), we present a highlevel characterization ( §-III) of the winners and the relative success of different domains.We then study what factors most influence the success of a tweet ( §-IV, §-V) before discussing related work ( §-VI) and presenting our conclusions ( §-VII).

II. DATA COLLECTION AND CLONE IDENTIFICATION
Clone Set Identification Framework: To collect clone sets, we developed a framework consisting of three main components: (1) a tweet retriever, (2) a text extractor, and (3) a clone finder.First, the tweet retriever retrieves tweets containing links to news articles using Twitter's API Academic Researcher product track.Here, we first obtained all tweets posted within an example timeline that contained an URL and then filtered the (resolved) URLs against the domains owned by a list of the most popular US news outlets.Second, the text extractor was used to extract the news article texts (but not figures, videos, etc.) from the URLs.Here, we used a combination of the open-source news-please Python module and a custom text extractor that we implemented for news websites with more complex structures (that newsplease performed poorly on), as well as a per-domain specific crawler.Our custom-built crawler was built using the library Beautiful Soup.Finally, the clone finder module identifies potential clones by first grouping all tweets using the same URL and then applying a two-phase clone (or near-clone) identification approach on the extracted texts (when available).
With the two-phase identification, we first use Simhash [4], a technique that generates 64-bit fingerprints for each text (64bit has helped avoid collisions compared to 32-bit hashes), to check for similarity using a maximum hamming distance of 6 (ensuring a high recall), followed by calculating the pairwise cosine similarity on the vectorized texts (created using TF-IDF) of all pairs within a candidate cluster, so as to further refine the clone sets (and improve the precision).Our manual inspection showed that using a similarity threshold of 0.9 and combining these two phases (i.e., simhash for initial candidate clone identification followed by pairwise similarity tests within a cluster) reduces computational complexity (i.e., limits the required pairwise tests) and enhances accuracy (i.e., avoids unnecessary exclusion of potential clones, enabling more rigorous assessment in the subsequent cosine similarity).
News Outlet Selection and Data Preparation: To select news outlets (for URL filtering and text extraction), we used the ranking lists of several independent rankings of US news outlets (e.g., Allsides, opensources.co,pewresearch, statista, feedspot, and yougov).The 69 selected news outlets represent a diverse range of topics, geographical locations, and audiences.
Datasets: Four datasets were collected based on the age of each post at the time of data collection (one-year old, halfa-year old, one-month old, and one-week old) and for each dataset we collected two snapshots: one when the posts are of the above listed ages and one that was collected one week later.In both cases, we collected all possible statistics about the tweet (including retweet statistics) and the tweeter of the tweet.The different aged datasets allowed us to analyze the effect of age differences on retweet behavior, while the retweet recollection one week later allows us to evaluate and reflect on the stability of the retweet counts over time.
Table I summarizes the size of the datasets.All datasets were collected over the week of March 2-8, 2021.Combined, the four datasets consist of 4.5M unique tweets including links to one of our predetermined URLs.Of these, most tweets (3.9M) are part of one of the 274,510 identified clone sets.
Success metric: To measure the successful spread of a tweet, we primarily use the number of retweets.This choice is motivated by the high importance of recommendations by friends and family (e.g., 83% believe more in such trust-earned advertisements than regular advertisements [5]) and world-ofmouth advertisement in general.We also show results for other public interaction metrics such as likes, quotes, and replies.
Dynamics of Tweet Interactions: Fig. 2 shows the Complementary Cumulative Distribution Function (CCDF) of the    number of retweets, likes, replies, and quotes for the original dataset ("Original") and the one-week gains between the two snapshots ("Difference").Here, all curves are only slightly curved on log-log scale, suggesting heavy tailed distributions from a power-law-like family.Furthermore, the relative increases are small (especially for the datasets with older tweets), capturing the ephemeral nature of news and suggesting that most of the user interactions with these tweets already have taken place at the initial data collection.For example, except for likes (95% unchanged), 99% of the tweets see no change even for the 1-week old dataset, and for our primary metric (i.e., retweets), only 0.00001 of the tweets saw more than one hundred retweets during the second week.
Limitations: We acknowledge several limitations with our methodology.The findings may not generalize to other social media platforms.We do not consider the effect of Twitter's internal algorithms.The study is limited to linked news from the selected news outlets, which is based on public rankings but may not capture the full range of news sites.The text extraction process is not perfect and sometimes struggles with some pages with complex structures or that otherwise prohibit access.The choice of thresholds and parameters in the clone detection process (e.g., simhash hamming distance, cosine similarity threshold) is determined through manual evaluation and may not be optimal in all cases.Nevertheless, our manual sanity checking suggests that the methodology is able to achieve high accuracy and deliver clear clone sets.

III. HIGH-LEVEL CHARACTERIZATION
Before analyzing the characteristics of a typical winner or determining the content-agnostic factors that most impact the success, we first provide a high-level characterization of the dataset and the relative success that different clones achieve.
Clone set sizes highly skewed: Fig. 3 shows a rank plot, the empirical cumulative distribution function (CDF), and the Complementary CDF (CCDF) of the clone set sizes and number of website domains responsible for publishing the articles.(While these stats are for the one-year old dataset, the relative shape of the distributions for all datasets are similar.)From the rank plot, we note that the biggest clone set consists of 10,165 clones (top ranked entry in Fig. 3(a)) and from the CCDFs we note that 10% of clone sets consists of more than 30 clones (red line in Fig. 3(b)) and 3% link to articles published on at least 2 domains (red line in Fig. 3

A. Winner-based Analysis
To be called "winner" of a clone set we required the tweet to have more retweets than the 2 nd -most retweeted clone.While this resulted in a slight reduction of the number of clone sets it did not change our distribution statistics much (see Fig. 4).
Winner success follows power function of the clone-set size: As expected, the winners of big clone sets tend to be retweeted more.What is interesting is that the number of retweets that they obtain follows a power function of the cloneset size.This is seen by the straight-line characteristics of the green markers in Fig. 5, which shows the median number of retweets for the winners (1 st rank) for different clone-set sizes.We also note that the majority of clone sets with less than 4 are not retweeted.These results show that there is a strong relationship between what is cloned and what is popular to retweet.We have also seen noticeable differences in the median number or retweets obtained by the winner (1 st ; green marker) and 2 nd placed clone (blue marker).Here, it should be noted that both axes are on log scale.
Retweets and winners over time: To illustrate how tweets and retweets associated with specific clone classes within a clone set are distributed over time, we split the active time period of a clone set into five equal time periods.We call these time periods bins and count the number of tweets posted in each time bin that are of one of two tweet classes (tweets or retweets) and that are of one of the clone classes ("winners" or "losers").Fig. 6 shows the percentage of each of these four tweets subsets that fell into each of the five bins.We note that 71% of all retweets (across all clone sets) occur during the Fig. 6: Timing of all tweets and retweets, as well as "winners" and "losers".Here, the time bins are split equally between the first and last tweet observed within a clone set.
first time bin (20% of the lifetime of the clone sets) compared to only 54% of the origin tweets.It is therefore not surprising that most winners (62%) are posted during the first time bin.It is however also clear from the plot that it is not always the first mover (i.e., the clone that make the first tweet, and hence always in the first bin) that obtains the most success.
The reason we see somewhat more winners (and losers) in the last bin (than in bins 2, 3, 4) is due to cases where we only had two clones in a clone set (each assigned to bins 1 and 5).
The results above are relatively consistent across the datasets and filtering methods used.One reason for this is the activity associated with most clone sets is short-lived.For example, in our one-year-old default dataset, the median and average activity interval (that we break into five bins) are 30 and 48 hours, respectively, whereas these times only reduce to 24 and 42 hours, respectively, for the one-week-old dataset.For the 1-and 6-month-old datasets the medians (26/25 hours) and the averages (41/41 hours) are relatively similar.

B. Domain-based analysis
We next compare the relative success of different domains as seen (1) across all clone sets and (2) across only the clone sets that contain competing clones associated with different domains.Throughout the paper, we call the second group of clone sets "mixed" clone sets.For each of these two cases, Fig. 7 shows the percentage of times each domain was responsible for the clone that was the "winner" or "first poster".We also consider the case when the combined set of clones of a particular domain collectively garnered more retweets than that from any other domain ("most total").
With 97% of the clone sets only linking to a single domain, it is not surprising that there are no major differences between the three metrics seen in the "all" case (i.e., Fig. 7(a)).The "all" figure instead captures the relative number of clone sets dominated by a domain, with the top-3 domains being NY Times, Forbes, and Bloomberg.
Winners when competing only against links to the same domain: Before looking at the mixed clone sets, let us first look closer at the "winners" and "first posts" of the top-7 domains in the non-mixed clone sets.For each of these domains (one domain per row), Fig. 8 shows CDFs for each of the following metrics: (1) the ratio of retweets of the "winner" and "first post" of each such clone set, (2) the ratio of retweets of the "winner" and the post that achieved the median number of retweets, (3) the ratio of retweets of the "winner" and the average number of retweets per post, (4) the ratio of retweets of the "winner" and the "loser", and (5) the relative fraction of retweets that the "winner" obtained.To aid comprehension, we provide a baseline (top row) that represents the results for all clone sets.The plots are color-coded to indicate their relative performance compared to this baseline.A green background signifies a larger median (y=0.5)value than the baseline, while blue values indicate smaller medians.Median values and their relative differences are depicted with orange dotted lines and percentage values, respectively.Reuters exhibits the largest relative improvements compared to the baseline across three metrics ("winner/first", "winner/median", and "winner/loser").This suggests that winning posts from Reuters are often reposts, surpassing the "median" and "loser" by a significantly greater margin compared to winners from other domains.Conversely, for Forbes, Buzzfeed, Bloomberg, and CNBC, the majority of winners are the "first posts", indicated by the vertical line in the first column of the plot, and their performance does not outshine the others to the same extent.
Mixed clone set analysis of competing domains: The relative rankings become more interesting when considering mixed clone sets (Fig. 7(b)) and how the ranking changes compared to the full dataset (Fig. 7(a)).Here, the most frequent "winner" (and "first poster") is Reuters, which goes from being ranked 6 th to 1 st and being responsible for almost 50% more winners than the 2 nd ranked domain (NY Times).We also see some big drops in the rankings (e.g., Forbes went from ranked 2 nd to last of the ranked domains) and a few cases with noticeable differences between the metrics.
Perhaps most noticeably, considering the mixed clone sets, Yahoo is the 2 nd most frequent "first poster" but only achieves the 7 th best "winning" percentage.This highlights that it does not always pay off being the first to post on a topic.One reason for this is that Yahoo clones tended to have relatively fewer retweets in general than the other top publishers.This is illustrated in Fig. 9, where we show the CDFs and CCDFs of the total number of retweets obtained across all clones linking the top-7 domains (blue curve) and their winning clone (red curve) together with the 75%-ile values.We note that domains with more "first poster" instances than "winner" instances (Yahoo and Business Insider) obtained the fewest retweets, whereas the opposite was true for the two domains that obtained the most retweets (NBC News and ABC News).
Head-to-head competition: The difference in the domains' relative success when their associated clones are posted also   became evident when comparing their winning percentages going head-to-head.Table II shows the fraction of times that a domain (row) won over another domain (column).We again use the top-7 domains in the mixed clone sets.Here, we use darker colors to indicate a bigger win fraction for the domain listed on the row over the domain listed on the column.Again, Yahoo and Business Insider are the most likely to lose (mostly yellow rows and red columns).Among the winners, Reuters and LA Times stand out, with the biggest portions of pairwise wins (mostly red rows and yellow columns).

IV. WHAT MAKES A WINNER? A. Single-factor Analysis
We next discuss factors that were found to increase the success probability of a post.Here, Table III provides a summary of the percentage of clone sets where the "winner" had more or equal number/quantity of the variable of consideration than the "first post", "loser", and "median" clone.Furthermore, we split variables into three categories based on whether the variable captures the characteristics of the tweeter ("User"), the tweet itself ("Tweet"), and four measures of success ("Success").For simplicity, we do not include less-good predictors, and in the following, we discuss the variables for which the "winner" had at least the same value for ≥80% of the samples and the metric was available for ≥10K comparisons.
Tweeter of a clone: The person/account responsible for a clone plays a big factor in its probability of success.While such correlations can be found in regular datasets, the role of the origin tweeter is perhaps more clearly and fairly captured when working with clone sets, as they neutralize the effects of the shared content.
To illustrate the effects of the tweeter characteristics, Fig. 10 shows CDFs of the number of listings and followers associated with the tweeters of the different clone categories.It is clear from the significant shift in the follower curves (note logscale) that the accounts behind the "winners" (most retweeted clone), followed by the "first poster", compared to the other categories, that the "winning" accounts often already have built part of their success before the time of posting the tweets.By having attracted followers that will see their tweets, they are more likely to succeed also with future tweets and benefit from the rich-gets-richer effects (from the perspective of the tweeters).What is perhaps a bit more surprising is that the "first poster" often is a user that has relatively more listings and followers.The finding suggests that initial posters often have a substantial following, indicating that users who consistently share original content are more likely to gain followers over time.It is important to note that some "first poster" clones were shared by prominent news accounts promoting their own articles.
Tweet lengths: Consider next the tweet itself.Here, the use of clones for head-to-head comparisons becomes even more important (in part because the tweet sizes are more uniform).However, when comparing clones, we have found that the "winners" tend to use somewhat longer tweet texts than the other tweet categories.This is shown by the shift in the CDF of the "winner" category relative to the other CDFs (Fig. 11(a)) and the fact that this category has a significant larger fraction of tweets close to Twitter's 280-character limit (e.g., much sharper increase around this point).The success of longer tweets is also visible in the heat-map scatter plot (Fig. 11(b)), showing the number of tweets of different text lengths that obtained a certain number of retweets.Larger tweets exceeding the 280-character limit can be attributed to Twitter's inclusion of previous tweets in reply chains as mentions, which do not count towards the character cap [6]. Figure 11(c) demonstrates how this feature may enhance the popularity of longer tweets.
Public success metrics: Finally, we compare how much more success the "winners" achieved than the other categories using each of the four success metrics: retweet count (our default success metrics), likes, quotes, and replies.Fig. 12 shows both CDFs and CCDFs for each of these metrics.
As expected, the "winners" outperformed the other categories noticeably with regards to all four metrics, and the "first post" always achieved the 2 nd best of the five categories.
V. MODELING SUCCESS Building upon the empirical findings from the previous section, we next model tweet success using linear regression.
Pairwise correlations and multicollinearity: Before building regression models that take into account the clone sets, we first consider the correlations of the observed variables when ignoring clone set belonging.Fig. 13 shows the pairwise Pearson correlation coefficients as a matrix, with the matrix sorted so that the metrics are sorted from the metric with least-to-the-most correlation with the number of retweets.As expected, the best highest correlated metrics are the three other success metrics: likes, quotes, and replies.
Preliminary test models for the full dataset then indicated the presence of multicollinearity.To quantify collinearity, we calculated the Variance Inflation Factor (VIF) for each variable within clone sets.Although mentions and hashtags showed a slightly higher correlation with tweet text length, we decided to include them in the final model, considering their different variations.To mitigate multicollinearity, we used adjusted R 2 as selection criterion and applied all possible subset regression.
Best subset analysis: For each clone set, we conducted a best subset analysis, selecting models with the highest adjusted R 2 for each variable count.For the analysis presented here, we will exclude the three other success variables.1 Fig. 14 shows the best subset analysis for clone sets without the three public success metrics.Here, the user follower count emerged as the most important predictor, included in 89% of models, followed by "user verified" (75% inclusion).Both predictors exhibited more significance, with a higher proportion of low p-values.User tweet count (70% inclusion) and user following count (66% inclusion) were the third and fourth most frequent predictors, with 28% significance each.Tweet text length Fig. 12: CDF and CCDF of success metrics for one-year-old dataset.appeared in more models but lacked significance in most, likely due to its correlation with mentions and hashtags.Photo included had limited significance, and video included was rarely selected.Models with 12 predictors performed worse on average compared to those with fewer predictors.Domain-based models: Table IV presents the per-domain results for the clean clone sets, excluding public metrics.Once again, the most significant predictor is the user follower count, followed by user verified.The inclusion of user tweet count and user following count is closely balanced.However, in the clean subset, we observe that the variable "user verified" for the domain "Forbes" has a significantly lower inclusion rate, appearing in only about 40% of the best models.This contrasts with other domains, where it has a higher inclusion rate.Interestingly, in the mixed subset (results omitted due to space), "Forbes" exhibits a substantially higher value for "user verified", surpassing other domains.This suggests that the identity of the "Forbes" domain has a significant impact on the "user verified" variable, as the clean subset shows a notably lower inclusion rate compared to the mixed subset.Except for the user verified variable, the results do not significantly differ between the domains or between the mixed and clean analysis results, suggesting that there is no substantial difference in how tweet variables relate to the URL domain identity mentioned in the tweet.Finally, we note that "video included" appears to be the weakest indicator of success, consistently performing poorly.This finding aligns with the results shown in Fig. 14.

VI. RELATED WORK
This work aligns with research exploring factors influencing post popularity on different social media platforms.Studies have investigated how content impacts engagement and diffusion.For example, [7] demonstrated the effect of emojis on user engagement while [3] revealed YouTube popularity factors when controlling for near-identical videos.A survey of this research line is available in [8].
Focusing on Twitter, studied here, various works have examined the impact of different factors.[2] highlighted content as an important factor for gaining retweets.[9] found that URLs and hashtags in the content have strong relationships with retweetability, and [10] showed that the subject of a tweet is the most informative content-based feature.
Contextual features also play a significant role.The number of followers and followees, as well as the account's age, seem to affect retweetability [10].In a related study, [11] explored tweet features, including content and contextual factors, to

Fig. 1 :
Fig. 1: Example clone set with posts linking news article clones posted by three news outlets.

Fig. 1
illustrates the concept of a clone set.Here, distinct colors are used to denote tweets linking cloned versions of a news article published by different news outlets, while the order and height of the bars illustrate the relative timing and quantity of retweets for each such tweet, respectively.Using the clone concept, we can then control for the content and study which tweets are most successful and what content agnostic factors most influence a tweet's future success in generating retweets.For example, what factors most influence which tweet in a clone set will be the winner, and to what degree do we observe a pronounced first-poster advantage?

Fig. 3 :
Fig. 3: Distribution of clone set sizes and domains for the one-year-old dataset.

Fig. 5 :
Fig.5: Median retweets for winners(1 st ) and second places clones (2 nd ) for a one-year-old dataset.The size of each marker represents the number of clone sets of that size.

Fig. 7 :
Fig. 7: Frequency that a domain had the most retweets in a clone set.

Fig. 8 :Fig. 9 :
Fig.8: CDFs of relative "winner" comparisons for the top-7 domains.Green/blue indicate higher/lower median than baseline (top row) and percentage values show the relative median differences compared to the median baseline.

Fig. 10 :
Fig. 10: CDFs of the two most impactful tweeter characteristics for different subsets of clones.

FFig. 14 :
Fig. 14: Percentage of model occurrences derived from all clone sets through best subset selection (without public success metrics).Full bars show all occurrences, and the light colored parts indicate those where the predictor had p-value < 0.005.

TABLE II :
Head-to-head wins/competitions for one-year-old dataset.Darker cells indicate bigger win percentage for the domain listed for that row when competing with the domain listed for that column.