Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.


INTRODUCTION
With the rise of online social media as an important means for connecting with others and sharing information, the influence of bots, or automated accounts, has become a topic of vital societal concern.Some bots are benign and serve content which is entertaining or directly enhances the accessibility of the site (e.g., by providing captions for otherwise uncaptioned videos on a platform), but many others engage in influence operations, the spread of misinformation, and harassment: fake followers boost some users' perceived popularity; spammers flood the site with advertisements for a political candidate or product; malicious automated accounts undermine the credibility of elections or inflame polarization.Bots have reportedly influenced the 2016 US Presidential Election [5,38], the Brexit vote in the UK [4,38], the spread of misinformation about COVID-19 [27] and financial markets [13,54].The ability (or inability) to accurately label such accounts could have a very real impact on elections and public health as well as public trust in institutions.
Platforms remove large numbers of accounts that they deem inauthentic, but they keep these removal systems secret and may be incentivized to misrepresent the influence or prevalence of bots.Indeed, bot detection was at the center of Elon Musk's negotiations to buy Twitter: Twitter claimed that less than five percent of its monetizable users are bots [68] while Musk claimed the number is much higher [53].Because internal bot detection techniques are in general not made public, researchers, journalists, and the public at-large rely on researcher-developed tools to separate bots from genuine human users and understand the impact of bots on social phenomena.
Developing tools for bot detection on Twitter and other online social media platforms is an active area of research.Over the last decade, an abundance of user datasets have been collected for the purpose of enabling third-party bot detection.Tools trained on these datasets achieve high (sometimes nearly perfect) performance using expressive machine learning techniques such as ensembles of random forests and deep neural networks, and hundreds or thousands of features such as profile metadata, engagement patterns, network characteristics, and tweet content and sentiment.
Crucially, researchers frequently use bot detection as a preprocessing step to study social phenomena, to separate human users from bots and study phenomena related to one or both of humans and bots.This includes topic areas such as the spread of mis-or disinformation [7,42,55,[63][64][65]69], elections [3,5,26,43,56,66] and echo chambers [8] and published in premier venues for scientific research including Science [69], Nature [55] and PNAS [66].For example, Broniatowski et al. [7] observed that bots eroded the population's trust in vaccinations, González-Bailón et al. [37] conclude bots share disproportionate amount of content during political protests and Vosoughi et al. [69] conclude that humans and bots spread fake news in different ways.The robustness and validity of these results depend on accurate and reliable bot detection.
Third-party bot detection tools are also easily accessible to and widely used by the public: the most recent version of Botometer [62] reportedly receives hundreds of thousands of daily queries to its public API [76] and BotSentinel [6] provides a browser extension and ways to conveniently block accounts classified as bots.
Is bot detection fit for downstream use?To an outsider -someone who might want to use bot detection but has not done research on the topic -bot detection might seem like a case study in the successful application of machine learning to an important problem: researchers have collected a variety of datasets for a well-defined classification task and expressive machine learning models like random forests and neural networks attain near-perfect performance on the data.Moreover, these methods have been widely adopted in both the academic literature and in public use.Bot detection tools are frequently trained on a combination of datasets, and researchers have argued that the existing approach can adapt to the short-comings of existing classifiers or evolution of more humanlike bots by adding more datasets [62] or using even more complex techniques, like generative adversarial networks [10].
Even so, there are signs that bot detection tools are far from perfect.They may disagree with one another [49], prove unreliable over time [58], and rely on dubious labels [28,29].Moreover, bot detection researchers certainly do not consider the problem solved and have, in previous research, observed that bot classifiers can fail to generalize [20] and surfaced concerns that more sophisticated bots may go undetected [10].Here, we attempt to reconcile and systematically explain the apparent achievements of Twitter bot detection with what seem to be significant challenges and limitations.
Evaluating third-party bot detection datasets and tools is inherently challenging: the "ground truth" is unknown or inaccessible to the public, and the only window of insight we have into bots on Twitter is through the datasets themselves.However, this does not make evaluation impossible.We can still gain a better understanding of what these datasets tell us by closely analyzing the datasets and how they relate to one another.
Consider, for example, the dataset released by Cresci et al. [11] (cresci-2017), one of the most widely used in the academic literature.This dataset consists of a pool of genuine human users, collections of fake followers, and several types of 'spam bots': a diverse collection of accounts in this domain.The state-of-the-art model is a  deep neural network using text data which achieves essentially perfect performance on this dataset [45].However, a closer inspection revealed something surprising: we can achieve near-state-of-the-art performance using a classifier that asks a single yes/no question of the data.In fact, there are at least two different yes/no questions which nearly separate the human and bot classes.These classifiers are shown in the left and middle decision trees in Figure 1.The left tree is an artifact of convenience sampling from Avvenuti et al. [2], which concerns social sensing of natural disasters using Twitter. 1n the right of Figure 1 we show another high-performing classifier for another popular dataset: caverlee-2011 published in [46].Again, a small number of yes/no questions distinguishes humans from bots with high accuracy.These examples are not exceptional cases.As we will show, almost all of the other benchmark datasets we analyze admit high performance using very simple classifiers.How should we reconcile these results with our intuition that that bot detection is a difficult problem?On the one hand, it is possible that bot detection is inherently simpler than expected, and simple decision rules suffice.On the other, perhaps the datasets themselves fail to capture anywhere near the true complexity of bot detection.If this is the case, then while simple decision rules perform well in-sample, their performance will be significantly worse when deployed.We provide evidence to support the latter hypothesis across a wide range of Twitter bot detection datasets.
Our contributions.In this work, we carefully examine widely used datasets for Twitter bot detection and explore their limitations.First, we demonstrate that simple decision rules perform nearly as well as the state-of-the-art models on benchmark datasets.Thus, each dataset only provides predictive signals of limited complexity.Because our simple decision rules allow us to transparently inspect the reasons for our classifiers' high performance, we find that predictive signals in the datasets likely reflect particular collection and labeling procedures, i.e., the processes for collecting accounts from Twitter and assigning a human or bot label to each account.
Next, we examine combinations of datasets.Many bot detection tools combine datasets (see, e.g., [19,39,77]) and argue implicitly or explicitly that it is possible to cover the distribution of bots that appear on Twitter by doing so.Building on prior work [20,62], we show that expressive machine learning models trained on one dataset do not perform well when tested on others and that models trained on all but one dataset perform poorly when evaluated on the held-out one.Information provided by a dataset does not generalize to others, suggesting that datasets are distributed according to dissimilar distributions, which indicates different sampling (i.e., collection and labeling) procedures.
Finally, we consider whether imposing structural assumptions about the data, namely that each dataset contains bots from one of a small number of types (e.g., spam bots or fake followers) can yield greater generalization as the approaches in Sayyadiharikandeh et al. [62] and Dimitriadis et al. [19] suggest.We find that simple decision rules can accurately differentiate bots of each type from humans.Thus, each sample of bots of one type is itself of low informational complexity.We additionally show that, within accounts of a particular bot type, simple decision rules can identify from which dataset a given bot originates.Thus, datasets of a given bot type are drawn from very different distributions, again indicating different data collection procedures.Taken together, these results suggest that each individual dataset contains little information, predictive signals in each dataset are not informative for prediction on the others, and this is true even within datasets representing a particular type of bots.Therefore, existing datasets are unlikely to provide a representative or comprehensive sample of bots and our analysis can explain why it is unlikely that classifiers trained on this data will perform well when deployed.
Beyond bot detection, our methodology -examining simple decision rules on datasets and measuring cross-dataset performance -may be useful for detecting simplistic data sampling and labeling processes in a range of machine learning applications: If datasets admit highly accurate simple decision rules, the datasets themselves have low informational complexity.If, additionally, expressive machine learning models trained on some datasets do not generalize to other datasets, the underlying system does not appear to be simple, and the datasets are unlikely to provide insight into the problem domain as a whole.
We also believe these findings have direct implications for future bot detection research both on Twitter and beyond: creators of bot detection datasets should transparently report and justify sampling and labeling procedures; researchers developing bot detection techniques should train and analyze simple, interpretable models alongside more expressive ones; and researchers using bot detection as a preprocessing step should consider how it may bias results.

BACKGROUND
Bot detection techniques.Researchers have used a range of cuttingedge machine learning techniques to bot detection across diverse types of data in order to improve classification.One approach applies random forests [34,74] and ensembles of random forests that combine predictions from classifiers trained on subsets of data [19,62].Another popular approach leverages text data to apply large pre-trained language models [40] or models trained by the researchers themselves [30,41,45,48,50].A third approach uses network data to train graph neural networks [1,22,25] or try to detect botnets from anomolous network structures [72].Finally, a fourth approach seeks insight from other disciplines by using behavioral [32,36] or biology-inspired techniques [15][16][17]60].In addition to novel predictive models, significant effort is spent deriving or exploring profile, text or network features that are likely to be informative for bot detection [41,51].All of the papers cited above rely on the benchmark datasets analyzed in our work.
Limitations of bot detection tools.Several papers have explored the limitations of bot detection techniques, but few provide evidence to explain these limitations.To the best of our knoweldge, our work is the first that traces the limitations of bot detection to simplistic sampling and labeling strategies.Martini et al. [49] compare three public tools for bot detection and finds significant disagreement of predictions across tools.Relatedly, Rauchfleisch and Kaiser [58] find that a single tool may produce varying results over time as a result of variation in account activity and Torusdağ et al. [67] created bots that can reliably evade existing bot detection frameworks.Elmas et al. [21] find that the qualitative observations of prior work, such as that bot accounts are typically recently created or are marked by a high volume of activity, do not hold on data collected for their paper and conclude that popular classifiers may not generalize.Gallwitz and Kreil [28,29] manually identify individual accounts that are incorrectly labeled as 'bots' in popular datasets, noting a high prevalence of false positives and arguing that the labels which are typically taken as ground truth may have errors.Researchers have also explored and tried to quantify the practical difficulties of bot detection.For example, Cresci [10] argues that bots may become more sophisticated over time to evade detection and Echeverria et al. [20] provide evidence to suggest that existing tools may not generalize to out-of-sample data.By contrast, our analysis in this work focuses on finding concrete, quantifiable explanations for these limitations, developing a framework for evaluating whether datasets are suitable to the prediction task at hand, and providing recommendations for future best practices in data collection and labeling.

DATA & METHODS
In this section, we discuss the datasets we analyze and our criteria for including each in our analysis.Most benchmark datasets in the literature are aggregations of data collected across various contexts.The benchmark datasets we study are summarized in Table 1.

Dataset collection.
To collect a list of benchmark datasets, we searched Google Scholar for peer-reviewed papers related to bot detection and within the references of papers we found.We found a total of 58 papers using at least one of the datasets we included in our analysis, of which 22 had at least 50 citations on Google Scholar at the time of writing (while several had at least 500 citations) and 26 of which were published since 2020.In our analysis, we only include datasets that were used in multiple peer-reviewed bot detection papers reporting accuracy and F1 scores that we found in our search, although nearly all of the datasets were used in many more than two.Several of the datasets were accessed via the the Botometer Bot Repository. 2  For the rest of the datasets, we reached out to the author(s) of the associated paper to request access to the original data (for twibot-2020 and yang-2013) or found public access to the data online (in the cases of caverlee-2011 and pan-2019).
We also received augmented data for gilani-2017 as was used in the original work [32][33][34] from the authors, though a reduced feature set is available on the Bot Repository.For gilani-2017 and caverlee-2011, the original data provided by the authors [34,46] contained at least 35% more users than are included in the Bot Repository; we use the larger, original datasets in our results.For the astroturf and varol-2017 datasets published on the Bot Repository, the data only came as a list of user identifiers.Due to the amount of time that has passed since their origination, we did not rehydrate that data or use it in our analysis.
Features.All of the datasets include profile characteristics, which typically include screen name, number of tweets, number of followers, number of following, number of favorites, language, location, timezone, number of Twitter lists on which the user is included, and others.Additionally, several datasets include a corpus of tweets by each of the users in the dataset.Network relations and associated following/followers behavior are occasionally recorded.
Annotation methods.Ascertaining 'ground truth' labels for bot detection is a challenging task.In most datasets, humans, either the authors of the paper or hired crowdworkers, assigned a 'bot' or 'human' label to each account manually.Previous work has found human annotators have a high level of agreement with each other [34], and accounts for which there is not enough agreement are sometimes excluded from the datasets [24].Others used heuristics or relied on external sources (e.g., celebrity accounts [celebrity-2019] or accounts that tweeted links from public blacklists [yang-2013]) to assign them.The quality of the hand-and heuristic-labeled datasets depends crucially on the implicit assumption that humans are very good at the classification task, and neither the datasets themselves nor the broader literature provide robust evidence that this is the case.To the contrary, recent evidence suggests human annotators are systematically biased towards believing opinionincongruent accounts are bots [71,73].Similarly, there are accounts for which neither bot nor human labels may be appropriate, such as semi-automated accounts or accounts that represent an institutional entity like a corporation or a university [9].Nevertheless, since other work assumes labels in the data are ground truth and since better annotation methods are not available, we make the same assumption.

Dataset descriptions.
The datasets which we considered fall into two categories: component datasets, which consist of a single class (human or bot) of accounts, and composite datasets, which consist of a combination of component datasets.Each of the 28 datasets is described briefly below.Unless otherwise specified, the authors of the associated paper hand-labeled the dataset.
social-spambots-1 [11] are spam accounts used during the 2014 Roman mayoral election to promote a particular candidate.social-spambots-2 [11] are spammers who promoted the Talnts app 2 https://botometer.osome.iu.edu/bot-repository/datasets.html using the hashtag #TALNTS.social-spambots-3 [11] contains accounts which spammed links to products on Amazon, both genuine links to products as well as malicious URLs.traditional-spambots-yang [74] are accounts spamming known malicious links, collected by crawling the Twitter network.genuine-accountsyang [74] are accounts which did not tweet a malicious link, taken from the same crawling process as traditional-spambots-yang. traditional-spambots-2 [11] includes accounts that share malicious URLs and accounts that repeatedly tag those sharing such content.traditional-spambots-3 [11] and traditional-spambots-4 [11] are accounts spamming job offers.pronbots-2019 [75] are Twitter bots infrequently tweeting links to pornographic sites.elezioni-2015 [14] are manually labeled Italian language accounts that used the hashtag #elezioni2013.political-bots-2019 [75] were collected and identified by Josh Russell (@josh_emerson) to be automated accounts run by a single individual to amplify rightwing influence in the U.S. midterm-2018 [77] includes accounts which used relevant hashtags such as #2018midterms during the 2018 US elections.stock-2018 [13] are accounts with high volumes of tweets that tagged stock microblogs with cashtags.genuineaccounts-cresci [11] is purportedly a random sample of human Twitter users, confirmed to be genuine by their response to a natural language question.These are the accounts that all tweeted "earthquake", using data from previous work on crowdsensing for natural disasters [2], mentioned in Section 1 and discussed in Section 4. twibot-2020 [24] was collected by crawling the Twitter network using well-known users as seeds.The accounts were manually labeled by hired crowdworkers.gilani-2017 [34] contains accounts sampled from Twitter's streaming API.
rtbust-2019 [51] contains manually labeled accounts subsampled from all accounts which retweeted Italian tweets during the data collection period.fake-followers-2015 [11] and vendorpurchased-2019 [75] are fake follower accounts purchased from different Twitter online markets.caverlee-2011 [46] were collected via honeypot Twitter accounts and researchers used a humanin-the-loop automated process to label bot and human accounts.celebrity-2019 [75] are manually collected verified celebrity accounts.the-fake-project-2015 [14] consists of accounts which followed @TheFakeProject and successfully completed a CAPTCHA.botwiki-2019 [77] is a list of self-identified benign Twitter bots, for example automated accounts that post generative art or tweet world holidays.feedback-2019 [75] is a collection of about 500 accounts which users of Botometer flagged as being incorrectly labeled by that tool.
Simple decision rules.While sophisticated machine learning models are able to learn complicated relationships between patterns in input data and their labels, their flexibility generally comes at the cost of transparency and interpretability.We choose to instantiate 'simple decision rules' as shallow decision trees because their transparency allows us to easily examine exactly why each data point is assigned a label.Similar analyses are significantly more difficult or infeasible with the complex and opaque models predominantly used in bot detection.Researchers have used now-standard explainable machine learning tools like LIME [59] and SHAP [47] to bot detection models [14,44,77].However, none of these can demonstrate as we do that the underlying datasets admit simple, high-performing classifiers that rely on a small number of features.
Other simple machine learning models like linear regression, means, or a nearest neighbors classifier may be able to provide similar interpretability to shallow decision trees, but the choice of the exact method is not important for our analysis.We use scikit-learn's implementation of binary decision trees, 3 which are trained recursively on numerical data by choosing a feature-threshold pair (represented by a node) that best splits the data into two groups by class and then learning a decision tree on each group separately.When we incorporated text data (as in the left-most tree in Fig. 1), we used one-hot encoding for each word in the corpus.In our case, after a fixed recursive depth (corresponding to tree depth), the classifier outputs a label corresponding to that of the majority of examples in the group; these are leaves of the tree.We only consider trees of depth four or less to ensure the trees can be readily inspected and to avoid overfitting.See Figure 1 for several examples of shallow decision trees trained on benchmark datasets.
Performance metrics.The most commonly reported metrics used in the literature are accuracy and the F1 score.Accuracy is defined as the fraction of examples labeled correctly.When a dataset is not balanced between classes, the accuracy may be misleading since a naive model can achieve a high accuracy by always predicting the majority class.The F1 score in binary classification is the harmonic mean of the model's precision and recall.In our context, low F1 score indicates a classifier that either does not detect a high proportion of the bots or incorrectly labels a large fraction of the 3 https://scikit-learn.org/stable/modules/tree.html humans.The F1 score does not incorporate the number of true negatives, i.e., humans correctly labeled as humans, which is potentially misleading in contexts where bots outnumber humans.
Though the two metrics complement each other, both depend on the proportions of humans and bots in the data.For these reasons, it is hard to compare accuracy and F1 score results across models and datasets with different proportions of bots and humans.To provide additional clarity and comparability, we report the balanced accuracies (bal.acc.) of our classifiers, or the arithmetic mean of the true positive rate and the true negative rate.Balanced accuracy is a less useful metric when one has prior knowledge about the relative proportions of bots and humans in the context where the classifier will be deployed.

RESULTS
In this section we present and discuss the results of experiments run on the datasets identified in Section 3. In Section 4.1, we establish that simple decision rules, instantiated as shallow decision trees, yield near-state-of-the-art performance when trained and evaluated on these benchmark datasets and that the simple decision rules are suggestive of sampling and labeling procedures.In Section 4.2, we show that the information contained in one dataset is not informative for classification on other datasets, in other words classifiers trained on one dataset do not generalize to other datasets.Building on prior work [20], we next establish that training a classifier on all of the datasets but one and testing on the held out one yields performance not much better than random guessing.Predictably, both of these results are weaker in the cases where the held out dataset shares some data with the training dataset(s).
In Section 4.3, we assess an assumption made among popular bot detection tools in the literature [19,20,62]: each dataset of bots represents one of a small number of types of bots, like spam bots or fake followers.This assumption underpins the approach of training a series of specialized classifiers to detect each bot type and then combining their outputs to provide an overall prediction.We find it is indeed possible to build simple classifiers that perform well on differentiating one type of bot from humans consistent with prior work using more sophisticated models [62].
However, in Section 4.4, we provide evidence that datasets within a given bot type are not drawn from similar distributions; simple decision rules can also differentiate bots within the same type from different datasets.This implies that rather than each dataset being broadly representative of all or a part of the respective subspace containing that type of bot, these component datasets are drawn from narrow and easily separable regions of the sample space, meaning that the signals from each dataset are strongly influenced by sampling and labeling procedures.We conclude that we should not expect that the collection of more -even many more -datasets using similar simplistic sampling and labeling strategies will result in significantly more generalizable classifiers.

Decision trees on component datasets.
In Table 2, we summarize the performance of our decision trees against that of the state-of-the-art classifier on each dataset.For each dataset, we train a tree of each depth from one to four and report the accuracy and F1 score for the shallowest tree that achieves test accuracy and F1 score within 2.5 percent of best-performing tree; in other words, we favor shallower trees when performance is similar across depths.
In training and testing our models, we use five-fold cross-validation and report the results accordingly.However, twibot-2020 comes with a train/test split, which we use instead for comparability with prior work.We searched the literature to find the state-of-theart performance on each dataset.In the cases of midterm-2018 and stock-2018, we could not find papers reporting results on these datasets alone, so we omit entries for the state-of-the-art from the table.(In the literature, e.g., [19,39,52,62,70], these datasets are frequently used in combination with other ones.)Where a paper reported the results from multiple models or from different test sets, we recorded the maximum score achieved for each metric to make our analysis as conservative as possible.
We only train our shallow decision trees on the types of features used in the state-of-the-art models in order to make our analysis as conservative as possible.Thus, if a classifier from the literature is trained on profile features like the number of followers or number of tweets, but not text features, we use only profile features in our model.In many cases, the full set of features that were used to develop state-of-the-art models were not publicly available.Despite this, for almost all of the datasets, our simple decision rules perform nearly as well as the state-of-the-art.For all datasets except rtbust-2019, accuracy is within ten percentage points of the state of the art and all of those but gilani-2017 and caverlee-2011 are within five percentage points of the state of the art.For most datasets, the F1 score is similarly close to the state of the art.
For the "earthquake" classifier on cresci-2017 described in Section 1 and Fig. 1, an inspection of several of the human-labeled accounts on Twitter after we had trained the simple model revealed that they tweeted the word "earthquake, " after which an automated account replied, asking for more information about their situation.Subsequently, Cresci et al. [12] confirmed that the data originated in the authors' previous work on detecting natural disasters using social media [2].
Other datasets yield classifiers which are similarly suggestive of their sampling and labeling procedures.The third tree in Figure 1 shows that the account creation date is an important feature in distinguishing bots from humans in this dataset.This may result from the authors targeting the collection of spam accounts on Twitter, for about eight months starting in December 2009.Active spam accounts may have high turnover on the platform if they are reported by users or the platform targets them for suspension, so very few spambots in the dataset were created more than a month before data collection began.The originators of the dataset make a similar observation [46].For twibot-20, the depth one decision tree reported in Table 2 checks if the user is verified or not, and achieves nearly state-of-the-art performance on just this one feature.This may be an artifact of both the sampling and labeling strategy.The authors collected accounts by starting from well-known (verified) seed users and collecting the network around those users using a breadth-first traversal [24], and we expect many verified users to follow each other.Accounts were labeled by crowdworkers and discarded if annotators did not sufficiently agree with each other, but verified accounts were automatically labeled human and so they were not at risk of being excluded.The other datasets we The cases where our models underperform relative to the state of the art are informative: the state-of-the-art model for rtbust-2019 uses the time between a user's retweets, which we did not have access to.The bots in the dataset were identified by "suspicious" temporal retweeting patterns, so a simple decision rule with access to inter-tweet time may yield much higher performance.feedback-2019 is a small dataset collected from accounts reported to be misclassified by an earlier version of Botometer, and this is a complex sampling mechanism that may be hard to capture with our simple decision rules.For yang-2013, the F1 score is dragged down by low recall (the percentage of bots that were classified as bots), which may be a result of the classifier biasing predictions toward the human label since the dataset itself is more than 90% human.When a classifier is trained on a balanced subsample of this data, the F1 score of a depth four decision tree is within 3% of the state of the art.
We stress that these results are not intended to suggest that simple decision rules make useful classifiers for bot detection, as they can be in other domains [61]; instead, they reveal that bot detection classifiers are limited by the simple sampling and labeling procedures used to construct datasets.If we believed that bot detection is a simple, low-dimensional classification task, we might accept that simple classifiers suffice in this domain, but intuitively we do not expect this to be the case.In what follows, we support this intuition by considering how classifiers generalize across datasets.

Cross-dataset generalization.
Heuristic data collection and labeling practices can be useful when the sample space is also easy to describe.If the sample space is simple, we should expect that classifiers trained on one dataset perform well on others.We present evidence here that this is not the case, by showing that classifiers that perform well on a given dataset typically do not significantly outperform random guessing when tested on each of the others, even when using more expressive models.Similarly, we find that classifiers trained on all but one dataset and tested on the held-out dataset do not perform significantly better than random guessing in most cases.From this, we conclude that the best predictors for separating human and bot users are not consistent across datasets.
Train on one, test on another.For each dataset, we train a random forest using scikit-learn's default parameter settings with 100 trees on each dataset and evaluate test performance on each of the others, using an 80%-20% train-test split.We restrict the feature sets for all training and testing data to the counts of account followers, following, number of tweets and lists each user is on since these features are common to nearly all datasets.This allows us to compare each classifier on a consistent feature set.yang-2013 and caverlee-2011 do not contain information on Twitter lists, so we did not include that feature in classifiers trained or tested on those datasets.When we instead used the pairwise intersection of the features available for each train-test combination, the results were qualitatively the same.
For this experiment, we use random forests rather than shallow decision trees to limit the extent to which the poor performance of the models can be attributed insufficient expressiveness, though we did find similar results when using shallow decision trees.Further, many papers in the literature use random forests to achieve state-ofthe-art performance (see, e.g., [34,74]).Although it is possible that more expressive models, like neural networks, could achieve better cross-dataset performance than random forests for this experiment, we believe this to be implausible, since we can already capture much of the predictive signal in these datasets with simple decision rules.The results are summarized in Table 3, where each row corresponds to the dataset used for training and each column to the dataset used for testing.We see qualitatively similar results for accuracy and F1 scores, but omit those tables for brevity.
In Table 3, the diagonal entries largely reproduce the experiment from Section 4.1 using random forests, showing the unsurprising fact that these can fit each dataset well.Because we use a restricted feature set, the performance of on-diagonal entries is sometimes lower in Table 3 than the performance reported in Table 2. Most of the entries off of the main diagonal show model performance no better than (weighted) random guessing or assigning all examples to a single class.
In a few cases, we see balanced accuracies significantly higher than 0.5.This is the case for classifiers tested on cresci-2017 and cresci-2015 as well as to a lesser extent for midterm-2018 and yang-2013.These numbers may be explained by overlap between the datasets, as is the case between cresci-2017 and yang-2013 noted in Section 3, or similar collection and labeling strategies conducted by the same research group, such as is the case with midterm-2019, cresci-2017 and cresci-2015.Notably, training on yang-2013 yeilds poor performance when testing the classifier on other datasets, but other datasets yield classifiers that perform well on yang-2013, suggesting that it is a dataset that is in some sense easy to test on and not very useful for training classifiers.In some cases, we see significantly worse than 0.5 balanced accuracy.This may arise from the distributions of users being very different across these datasets.A model trained on a dataset where human accounts have very low activity and bots have very high activity will perform very poorly when evaluated on a dataset where these patterns are reversed.The generally unimpressive performance of most of these classifiers suggests that the collections of accounts Table 3: Balanced accuracy of random forests trained on the row-indexed dataset and tested on the column-indexed one.Leave-one-dataset-out.Following [20,62], we also train random forest classifiers on all but one dataset and test on the held-out dataset.We report the results in Table 4.In most cases, the signal contained in all but one dataset is not sufficient to predict the labels on the held out dataset.For the cases where this is not true, the datasets share some data in common or were collected and labeled using similar strategies.These results suggest that the best predictors in each dataset are different, and therefore that these benchmark datasets, even when combined, do not generalize to perform well on the others.In the rest of this section, we explore generalization across datasets further, examining whether making assumptions about the types of bots that exist in each dataset can be used to build generalizable predictors.

Generalization using a bot taxonomy.
We observe in the previous analysis that classifiers trained on one or more benchmark datasets do not give good performance on the others.However, previous work [19,62] has argued that different datasets may contain different types of bots-e.g., simple bots, spammers, fake followers, self-declared, political bots, and other bots (a catch-all category)-which could account for poor out-of-sample performance in cases where the bot types are different.The need to combine data from different populations is common in prediction problems.For example, to make predictions on patient data from adult and pediatric hospitals, it is necessary to combine data across age groups in ways that account for population differences.Similarly, bot detection may require differentiating between and combining data from qualitatively different types of bots.
Following Botometer [62], we define combined bot-type datasets as follows.simple consists of the bot accounts from caverlee-2011; spammers includes the social or traditional spambots datasets and pronbots-2019; fake-followers consists of fake-followers-2015 and vendor-purchased-2019; other-bots consists of feedback-2019 and rtbust-2019; financial-bots consists of stock-2018.We do not have access to data for the roughly 500 accounts in astroturf-2020, included in political-bots in [62].
In Table 5, we show that these datasets admit accurate yet simple classifiers for detecting a bot type against a sample of human accounts, indicating as in Section 4.1 that the predictive signal contained in each bot type-human dataset is also simple.For each type of bot, we take equal-sized random samples of human accounts and accounts from the bot type and learn a shallow decision tree.We report accuracy and F1, along with the depth that achieves the stated test performance for an 80%-20% train/test split.Since we balance the datasets, accuracy and balanced accuracy are equivalent and we do not report the latter.As before, we favor shallower decision trees when performance is comparable across depths.
Across the bot types, the simple decision rules achieve high accuracy against a baseline of 0.5 (random guessing or a naive classifier) on this task, summarized in Table 5.The performance on political-bots is perfect, perhaps as a result of the small size of this bot type in our data.We see only modest performance improvements by using a random forest rather than a simple decision rule and we omit the details of those results here.
We next learn classifiers that predict which dataset an account comes from within bot types in order to understand whether accounts from different datasets in the same category are substantively similar to one another.

Distinguishing datasets within bot types.
Continuing with the taxonomic viewpoint, here we address whether these datasets can be used to train models that generalize well only within a single class of bots.We provide evidence towards the negative by showing that it is possible to distinguish accounts from different datasets within a bot type.Even when restricted to the subspace of spam bots, for instance, the constituent datasets in that class are sampled in such distinct ways that a classifier can easily distinguish them.We also observe a similar phenomenon with the human accounts, indicating that the strategies used to gather and label human accounts also results in samples being drawn from different parts of that sample space.In Table 6, we report accuracy and balanced accuracy for the task of predicting which dataset an account comes from for accounts within a given type.Rather than a binary prediction task of bot or not, this is a multiclass prediction task where each account is labeled with the dataset it comes from, omiting the classes that consist of only one dataset.We train shallow decision trees to predict these labels, summarized in Table 6.Again we see minor performance improvements by using random forest classifiers, but prefer the decision trees for their interpretability.
For humans, we have access to six datasets, so a naive or random classifier should achieve a balanced accuracy of less than 0.17.However, we achieve accuracy greater than 0.65 and balanced accuracy of greater than 0.4, indicating that this classifier can identify which dataset a human account belongs to with probability significantly better than random chance.For spammers, we achieve balanced accuracy of 0.75 and nearly perfect accuracy on seven datasets, against a naive baseline of less than 0.15.The classifier here gains predictive power from artifacts of the sampling strategies used to collect the data.As examples, most of the accounts in pronbots-2019 liked several tweets whereas other spammers did not; accounts in social-spambots-1 sent many tweets in Italian whereas other spammers used English or only tweeted URLs; still other datasets can be separated by basic account activity features like the number of accounts followed.

CONCLUSION & DISCUSSION
In this paper, we provided evidence that Twitter bot detection datasets are limited by their simple collection strategies, explaining their poor generalization.These findings suggest that a limiting factor in advancing bot detection research is a lack of availability of robust, high quality data.If accounts are improperly discarded for being likely bots because they have not liked enough tweets or their account was created at the wrong time or they were accessed from a particular kind of device, that introduces errors into a downstream analysis and if any of those features correlate with the topic of interest, these errors may bias the conclusions of that analysis.
This work also highlights the broader issue of how opaque machine learning techniques may obscure certain flaws in the underlying data.While we examine the Twitter bot detection ecosystem in this work, our methods and recommendations should apply broadly to the study of any online social media platform and to applied machine learning research wherever datasets are reused across contexts.We encourage researchers who originate and publish datasets to be explicit about their sampling and labeling procedures, perhaps using existing documentation tools [31].Researchers and engineers building bot detection tools should carefully examine their training data, possibly via the use of simple models like shallow decision trees, to ensure the data captures the complexity of the space the more expressive model attempts to describe.Finally, those using bot detection as a preprocessing step should carefully consider how errors might propagate through this process and what kinds of biases might be present in their analysis as a result.Given how seriously the academic community and general public treat the problem of bot detection on social media, we also hope that the platforms themselves can facilitate work in this area by providing rich and robust data with high quality ground truth labels.

Table 1 :
Publicly available benchmark datasets for bot detection.

Table 2 :
Performance of our shallow decision trees (SDT) versus state-of-the-art (SOTA) on benchmark datasets.

Table 5 :
Distinguishing each bot type from humans.

Table 6 :
Distinguishing an account's dataset by type.