The Art of Cybercrime Community Research

In the last decade, cybercrime has risen considerably. One key factor is the proliferation of online cybercrime communities, where actors trade products and services, and also learn from each other. Accordingly, understanding the operation and behavior of these communities is of great interest, and they have been explored across multiple disciplines with different, often quite novel, approaches. This survey explores the challenges inherent to the field and the methodological approaches researchers used to understand this space. We note that, in many cases, cybercrime research is more of an art than a science. We highlight the good practices and propose a list of recommendations for future cybercrime community scholars, including taking steps to verify and validate results, establishing privacy and ethical research practices, and mitigating the challenge of ground truth data.


INTRODUCTION
In 2007, the seminal paper 'An inquiry into the nature and causes of the wealth of internet miscreants' [50] was published.This study analyzed a cybercrime market, using seven months of data from an IRC channel.Now, over a decade later, there is an established body of work analyzing, understanding, and exploring these online platforms used as a place to share knowledge, tools, techniques, and socialize.Cybercrime communities have evolved from IRC channels to forums and cryptomarkets, and are increasingly moving to mobile chat platforms.Many papers have since been published, with authors from a variety of academic disciplines providing their own unique insights into how these platforms are used for cybercrime.
The main goal of this paper is to provide researchers new to the ield with an overview of what they would be building upon, and the scientiic principles underpinning prior work.After analyzing 99 research papers on cybercrime communities, we conclude that in many cases, cybercrime community research represents more 'art' than 'science'.Research is often exploratory, depending on the dataset and methods available, in contrast to scientiic hypothesis testing.Cybercrime communities are diicult to study, with some of the main shortcomings including: (1) Lack of available data: Collections of cybercrime community data have only recently become available, with early research constrained to those with the resources to collect their own datasets, or to small subsets of community posts.Data collection requires setup and ongoing maintenance, which can increase research time, and may not be comprehensive.More recently, shared datasets have become available to avoid this issue, however they need to be kept up to date to not become stale.Indeed, whereas in other disciplines historic data might be actual and relevant, some types of cybercrime are rather volatile and unforeseen, e.g., COVID related scams, or cryptomining.Thus, it requires agile (and often not too concise) methods for data collection and analysis.(2) Lack of validation and generalization: Due to the adversarial and hidden nature of cybercrime, ground truth can be particularly diicult to establish.Furthermore, indings from one cybercrime community, or for one type of crime, may not generalize to another.(3) Lack of reproducibility: This falls into two concerns, namely providing suicient detail to enable other researchers to reproduce the study faithfully, or through sharing datasets and code, but also that research indings are strengthened if they are reproduced by others.Often, researchers focus on novel approaches and application areas, and reproduction of prior work is not incentivized.Most research is exploratory, with very little prior work testing formulated hypotheses.(4) Little consensus about the ethics of working with data, although some norms have emerged.The little consensus is particularly concerning with cybercrime data, since the processing of this data can cause additional harm to users if they are later prosecuted [69,70].(5) Privacy and legal issues when working with leaked and scraped private data.In this case, the data might contain relevant cybercrime information (e.g., actual purchases for hacking material) intermixed with data from licit users (e.g., personal messages).Thus, processing requires careful protections to keep data private.We systematize the literature relating to cybercrime communities, from measuring the growth of multiple platforms over time, to estimating the proceeds from speciic marketplace activities, for informing new researchers to the ield of the possibilities, as well as common pitfalls, limitations, and issues that arise when doing research in this space.We start by providing a broad overview of the ield (ğ3), describing what cybercrime communities are (ğ3.1),and goals of cybercrime community research (ğ3.3).We then explore the methodological approaches used in underground community research (ğ3.4), and explore ways to classify text, images and attachments data, including tools for categorizing threads or understanding the social networks based on posting activities.Next, we discuss the main challenges and limitations (ğ4), both inherent to the ield and weaknesses of prior research, including limited ground-truth datasets, and ethical considerations of research projects.Finally, we provide an overview of recommendations for future researchers and present future challenges in the ield (ğ5).Also, we consider the diferent data sources being analyzed (ğ3.5), including how datasets are collected and shared.

SCOPE AND DEFINITIONS
In the simplest terms, cybercrime consists of any criminal activities (such as fraud, theft, or distribution of child pornography) committed using a computer especially to illegally access, transmit, or manipulate data. 1o understand cybercrime communities, we consider platforms hosted on both the dark web (the part of the web inaccessible by standard browsers and not indexed by common search engines) and also the surface web, where discussions of cybercrime and advertisements for cybercrime products/services take place.We also include 'cryptomarkets' that are primarily used as drug markets with a focus on cybercrime activity.Note that we exclude from this scope hate and harassment platforms, so as not to duplicate existing work [124].
To ind relevant research papers on cybercrime communities, we start by using keywords related to cybercrime forum and marketplace communities, namely 'cybercrime forums'; 'cybercrime communities'; 'cybercrime marketplaces'; and 'underground forum analysis'.We used keywords on Google Scholar to ind papers, and for each paper, we expand our search using both the reference list of papers, and who the papers were cited by.Using this snowballing approach, we iltered papers to only those that it our scope of cybercrime communities, including cybercrime forums, marketplaces, and chat channels (e.g.IRC and chat platforms), resulting in a collection of 99 papers.We systematize the methodological approaches used in these papers, exploring what is being measured and how, from the micro to the macro.

OVERVIEW OF THE FIELD
Table 1 shows the results of our review and the papers that were included.First, we explore the sources of data used, whether these made use of leaked data sources, obtained through sharing agreements, scraped for the paper, or reused.Next, we indicate if the authors used multiple sites for their analyses, or relied on a single cybercrime community.We further indicate if they used the entire site, or instead explore a subset, such as a bulletin board or posts within a limited time period.We explore whether the authors indicate they obtained approval from a Research Ethics Board (REB), such as an Institutional Review Board (IRB) in the US, or an ethics committee, or if they were considered exempt.We note that some papers may have REB approval, but did not disclose this in the paper.We highlight if the authors used English language and non-English language communities.For validation and evaluation, we indicate how authors obtained or created ground truth data, if they removed any outliers in their method, and if they used external data sources for veriication or enrichment.
The papers in the review have a diverse set of methods, with very few taking the same approach.At most, we have 5 papers in a grouping within the table.We noticed that the majority of the papers focus on English forums (96.0%), scrape data on their own (54.5%), and for papers that use ground truth, the majority of these rely on manual annotations (40.6%).The majority of the papers do not remove outliers (87.9%) or consider external veriication (76.8%) to validate their results.Most concerning is the fact that most papers do not mention REB approval or exemption (82.8%).
A minority of papers have used data from leaked datasets (18.8%), raising privacy concerns.These datasets may contain personal data in private messages, and account data such as email and IP addresses which could be used to identify users.In addition, data obtained from public sources including scraped or shared datasets, may contain personal data in public posts.This could include personal information shared on a forum because an individual has been doxxed.Although this data is public, it does not exempt researchers from potential privacy violations.

Cybercrime Communities
A cybercrime community is an online platform (e.g., forum or marketplace) where the members engage in cybercriminal activities.The research community has analyzed a variety of cybercriminal communities, with researchers often focusing on forums analyzed by prior work, which makes a limited number of forums overly analyzed and many new forums relatively unexplored.While some communities are invite-only to limit membership to trusted individuals, this necessarily restricts all users, not just researchers.Cybercrime communities have varying levels of criminality, with a mix of discussion on both crime and other topics [12], used to build community and trust.
Cybercrime communities are purpose-driven.Despite diferences in online communities, there is some similarity in the way forums are commonly set up and structured [59,94].Forums tend to be broken down into boards or subforums, each relating to a topic or theme of conversation which members can contribute to.Boards contain threads: an ordered set of posts around a speciic theme or question set by the irst post.While longer threads may vary from the original topic (łof-topicž), moderators on the website may choose to close threads which do so.The irst post in a thread is signiicant in comparison to the replies: the user has chosen to start a new conversation topic, to propose a new idea or ask a question, and replies may either contribute back to the irst post, or to later replies.Each thread will also have a title.Marketplaces are structured diferently [30].Used for advertising goods and services for sale rather than discussion, they are usually segmented by 'department' for types of goods.Listings can be sorted and iltered.Each listing has a title, and also may have a description of the item, price, seller information (including their username and current rating), and a section for feedback or reviews.Compared to forums, the domain of text appearing in posts is typically smaller: posts advertising items will often provide a title, a description of the product, and a price.While extracting information from these is not straightforward, it can be a simpler task than extracting useful information from the free-form text discussions that appear in forums, which require processing natural language.
Marketplaces and forums thus have some structure, provided by categories and boards and other moderatorchosen types of structures, which is useful for analysis.This structure also diferentiates them from other platforms such as Discord, Telegram, and IRC [38].However, despite the structured nature of forums and marketplaces, extracting useful information from the threads and advertisements is not a trivial task [58].We note that some communities might contain a mixture of the diferent types [127], e.g., a marketplace with a dedicated forum, or a forum with an embedded chat.
Also, forum datasets may contain additional information which are unique to the forum.One example is reputation voting data, where each user has a reputation score which may signal trust, and other users can send positive or negative reputation points to provide feedback [92].

Typical Methodology of Cybercrime Community Research
Research steps can fall into one of two categories: problem driven, when a researcher is acquiring data and analyzing for a speciic use-case, and data driven, when a researcher has an existing dataset to develop new analysis methods or advance prediction performance for a given task.Figure 1 shows a typical methodology used for cybercrime community research.
In the problem stage, use-cases typically vary depending on the purpose of the research, including law enforcement, cyber threat intelligence (CTI), or criminological.The use case is used to formulate a problem deinition: setting out the goals, scope, methods, and legal and ethical aspects to the work.Data is then acquired by the researcher, pre-processed, and explored.This step can be iterative, as the problem deinition may be further reined or changed.Then, researchers carry out the data driven part of the task.In the data driven stage, various methods and tools are used.These are selected according to the goals, and may require experimental work to choose the optimal or a new approach.This is followed by veriication and validation.Again, the details of this step depend on the method selected, but could include checking for outliers, checking for anomalous results, and using external data sources to verify results.Finally, researchers will report and discuss their results.

Goals of Cybercrime Community Research
We annotated each paper included in Table 1 according to their goals.We irst developed the categories key actor detection, key actor analysis, longitudinal, economic, and cyber defence from our prior knowledge of the area.These are selected to categorize the goals within the diverse corpus of papers, and we note that each paper may have multiple goals.We then extended with three categories, namely discourse with topics and concepts discussed in communities, subcultural exploring aspects of community culture, and crime type for research focusing on speciic activities.We note that each paper may have multiple goals.These are displayed by year of publication in Figure 2, which shows that the rate of publication greatly increased around 2016.
As in every meritocratic system, online hacker forums have participants with diferent levels of knowledge and inluence.An emerging area of research is to identify underground communities to identify key cybercriminals (key actor detection) [46, 59, 65, 66, 92, 132ś135].Researchers predict if actors belong to a subset of members, such as 'key actors', 'proicient cybercriminals', 'expert hackers' or 'key-hackers' [1,45,77,82,106], where individuals organically stand out for their high reputation when compared to the vast majority of forum members.
Rather than looking back over prior user behavior, some researchers aim to detect what new topics are the focus of discussion (discourse) [57,121].The use case for such research is primarily to detect emerging threats [35,37,75,101,102,109,111].Others aim to uncover the type of jargon used within underground communities [73,114,115,129].Stylometry is also used to identify users who may be operating multiple accounts within and across forums.Afroz et al. [3] use lexical, syntactic, and jargon features to detect multiple accounts, and analyze their feature set by identifying those with the greatest information gain.
Analyzing cybercrime subcultures is another key research objective.The subcultural category includes how trust is gained and lost within communities [39,54], performances of masculinity and perceptions of gender [14], and the use of aggressive language on forums [25].Other researchers explore how underground markets are governed and organized [89].A wide variety of malicious activities thrive in cybercrime communities, which are explored in the crime type category.Researchers have analyzed currency exchange activities [12], eWhoring (a type of fraud where intimate imagesśoften stolenśare used to simulate sexual encounters for inancial gain) [63,93], money laundering [84], malware [53,61,74,77], and credit card fraud [126].Others have categorized illicit products and services across forums and marketplaces [12,40,43,44,67,81,100,122] and identiied potential supply chains [19].
Most organizations do not have resources to patch the increasing amount of vulnerabilities disclosed every month, and many of those software laws are never exploited in the wild [6,8].The idea of predicting software vulnerability exploitation is to prioritize the remediation of vulnerabilities that are likely to be targeted by malicious actors (cyber defence).Cybersecurity researchers explore the opportunities for early exploit detection by mining information about software vulnerabilities shared in underground forums together with security advisories published on white hack communities [6,88,105,110].
A body of research attempts to predict future attacks.This includes correlating malicious activities in underground forums and marketplaces with cyber-incidents collected from the logs of real-world enterprises [7,79,80,112,113].Another approach studies adoption behavior among community members to predict their future activities [76,78].Values, ideas and techniques are transmitted from one person to another, and this behavior is also observed for malicious actors [47,76,85,99,116].

Methods for Cybercrime Community Analysis
This section categorizes methods used for analysis of cybercrime communities.We diferentiate between two overarching methodological approaches: quantitative analysis methods, typically involving large-scale measurements and statistical models, and qualitative methods involving human reviewers reading and analyzing a subset of forum data.We further break down quantitative approaches into Social Network Analysis (SNA); Natural Language Processing (NLP) approaches including Text Analysis & Analytics, and Topic Modeling; and Machine Learning (ML).For qualitative approaches, we explore content analysis, including its use for crime script analysis.For further reference, Appendix ?? shows the methods used by papers over year of publication (note that some papers may use several methods).Social Network Analysis.Understanding the social interactions of cybercriminals is of great interest, since cybercrime is often fueled by a rich and active supply chain of products and services [19,119].SNA techniques are used to investigate, describe, and predict the overall relational cybercrime community network structures, and identify key-hackers.[6,29,56,78,79,82,98,104,112].Nodes, clusters, and relations can represent an approximation of the communication structure and position of individuals within communities.
Text Analysis & Analytics.Text analysis and analytics uses techniques from ML, NLP, and linguistics to extract measurements and insights from text data such as stylometry analysis [3] and hacker terms identiication [81,82].This includes using embedding models (e.g., word2vec or BERT), and text graph representations.In cybercrime communities, using these techniques is particularly challenging due to the high use of changing jargon [115,129].
Forum datasets may also include attachments, including tools for carrying out cybercrime, e.g., snippets of code [122].ML and NLP approaches have been used to classify these attachments for potential threats [108] and into known categories [128].
Topic Modeling.Topic models are capable of discovering the semantic themes (i.e., topics) within discussions and advertisements, by capturing associations among keywords (e.g., slang, new terminology, or product names) from an unlabeled dataset.These methods are useful to summarize large datasets containing slang, without needing to build a training dataset, and will continue to work with new unseen slang.Topic models require validation of results to ensure these are not meaningless.This includes choosing a suitable number of topics, which can be estimated using the coherence scores to identify cohesion, and can be manually checked using domain knowledge of words in each topic, combined with reading sample posts in each topic.Also, researchers may need to clean text data, such as removing URLs.
One commonly used topic model for analyzing cybercrime community datasets is Latent Dirichlet Allocation (LDA) [20].LDA is used with a corpus of documents (e.g., posts in forums) to discover topics represented as distributions over words, and can additionally identify the topic distribution of each document.LDA has been applied to attachments and tutorials to discover topics in hacker assets and threats [37,107], and to identify topics discussed by key actors [72,92].
Machine learning approaches.Existing research using ML techniques have primarily focused on detection and categorization tasks.Initially, categorization tasks were used on hacker communities, leveraging of-the-shelf unsupervised learning techniques, including topic modeling and clustering.[36,37,72,81,107,108].These techniques enable further understanding of content and trends on these forums.
Due to links of cybercrime communities with real world attacks and frauds [69,92], researchers moved towards predictive techniques in an attempt to anticipate potential incidents, and also focus on the analysis of actors of interest (community members of interest).Applications include automatically classifying malicious hacker content into pre-deined exploit labels [9,128], predict software vulnerability exploitation and enterprise cyberattacks [6,79,80,112], and developing adversarial learning and cross-lingual knowledge transfer techniques for detecting cyber threats [41,42].ML predictive approaches typically require feature engineering.Features used can be split into either metadata or text-based.Metadata-based features are based upon data obtained directly from the dataset, such as times of posting activity or members posting in the same threat to detect communities.For example, these have been used to predict where members may send a private message by training a model on sent public messages [90,119], for later use on datasets which may not contain private messages.These use a combination of features about the user's posting activity in addition to text-based features.Text-based features are commonly used for classiication models, such as for detecting certain types of activities across a forum dataset.
ML approaches have been combined with other techniques, combining with NLP tags and SNA features to analyze key actors on forums [92], and combining topic models with SNA metrics to rank forum members [55].
Qualitative approaches.Qualitative research approaches are useful for exploratory research into problems where there is little already known about the topic [60,63], such as cybercrime communities, which tend to include hidden and small populations.Many researchers also use a mixed methods approach, in which they combine the rich insights obtained from qualitative approaches with quantitative measurements [14,118].Qualitative research on cybercrime forums tends to be passive, content analysis, instead of interviews or ethnographic research.Qualitative research allows researchers to develop models, typologies, and theories to describe and explain issues.Theory that is built upon qualitative data in such an inductive approach is called grounded theory.Other research may take a deductive approach, where existing theory (for example, criminological theory about how criminal behavior is learnt [61]) is applied and tested.
An example of a deductive approach is crime script analysis.Crime script analyses are built upon the idea of 'schemata' from cognitive science, namely that people have basic understanding of how to interact in various social settings [33].Applying this understanding to criminal activities allows us to map out how speciic types of crimes are carried out, to gain insight into quite complicated crime types.In relation to cybercrime communities, crime script analysis has been applied to understand stolen data markets [62], credit card fraud [126], travel fraud [60], and eWhoring [63].Qualitative research has also been used for examining the ecosystem around the Internet of Things [15] and social behavior inside marketplaces [103].
Researchers need to carefully consider how to present research indings, and skillful writing is required.Furthermore, qualitative research can be quite time consuming and resource intensive, which is often underestimated (and poorly understood by many researchers who are not familiar with the process).

Data Sources
A key issue in research on cybercrime communities is the acquisition of data required for the analysis.Researchers either rely on their own collection technologies, i.e., web crawlers and scrapers [125], or use existing datasets available for research.Some researchers gain access to datasets collected by law enforcement or security companies and made available under a NDA (e.g.[61]), which may limit reproducibility.In this section, we describe current methods and goals, systematize the steps needed for data collection, and provide an overview of datasets of cybercrime communities that are available for academic research.We also discuss data enrichment by means of external sources.

Collection methods.
Depending on the research needs and the available resources, there are diferent methods for data collection.Our taxonomy of existing collection methods for forums and markets is based on the method used and the desired scope.
Method.Crawling large communities at scale typically requires the use of automated tools.However, these are not always needed or available.Three possible collection methods when sampling content from forums include: • Manual collection requires investigators to manually visit the sites online, selecting and storing the required content locally (typically, in text format).The method is valid in cases where a small sample is needed, and the efort required for the development and use of automated tools is not worthwhile.Indeed, this is the only collection method available to researchers lacking technical skills to develop or use automated techniques.
• Bulk crawling uses automated tools to fetch and store raw iles for oline processing.Links (URLs) are obtained using regular expressions, and are visited indiscriminately.Crawlers can use allow-and deny-lists, e.g., to limit crawling to a single community, or prevent link-traps that automatically close sessions or remove accounts [125].Bulk crawlers require low engineering efort to be scalable, but can obtain useless data and are sub-optimal, since they demand high resources (i.e., storage and network bandwidth).
• Targeted Crawling use custom scraping technology to adapt to the particular site being monitored.This way, while the content is automatically crawled, its information is being processed online (scraped).The collection can therefore be focused to the desired pages, at the cost of requiring custom adaptations for each site being monitored.
Scope.The research goals of a project determine both the spatial and temporal dimension of the crawling, and also the type of content collected.
• Comprehensiveness.Cybercrime communities contain miscellaneous content, e.g., boards for discussing politics as well as trading cyber-weapons, or markets selling both drugs and virtual items.Depending on the research needs the crawling can be for speciic sections or the entire site.Collecting all data requires more efort than focusing on particular areas.However, having a complete picture allows for a broader analysis, e.g., to understand pathways into crime.Most cybercrime forums contain many licit content, where community building and trust building take place.The relevancy of these sections depends on the research goal, and researchers with limited resources may choose to not collect such data.
• Content.Collection can be restricted to textual content only, or to also download media content and other artifacts (e.g., binaries or documents).The former is simpler, and still can include the URLs to the linked media or attachments for its collection afterwards (at the uncertainty of these link expiring).Downloading non-textual content allows for a more complete analysis, e.g., image banners advertising products or services.However, they also put researchers at risk due to legal concerns (e.g., for downloading illegal images).
Table 2 summarizes the collection methods implemented and the challenges discussed in ğ4 for the works that explicitly describe the data collection process.(1) Investigating and selecting the communities for the research, i.e., the sites to be crawled.Analyzing the access requirements and studying the Terms of Service.Consulting an IRB or equivalent for legal and ethical advice (see ğ4.1.1).( 2) Gaining initial access.If needed, registering accounts using disposable emails, or creating custom accounts.
(3) Conducting preliminary observation, including manual navigation and content inspection.Collecting and storing a small sample subset to design and test the crawler oline.(4) Deining the scope, such as collecting a single snapshot, or periodic re-visits, classifying the content that needs to be extracted, selecting those areas of higher interest or priority.(5) Crawler design and implementation.Deining the database, scheduling incremental crawling (if needed), designing custom scrapers, managing accessibility (e.g., using session cookies), implementing anti-crawling bypass techniques, etc. (6) Crawler deployment.Coniguring the infrastructure, e.g., deploying database, installing necessary software and preparing connections through VPNs, proxies or Tor Circuits.(7) Production.Tracking logs for status monitoring, maintenance and error management.(8) Post-crawling.If needed, conducting incremental crawling, re-designing scrapers for site updates, adding further anti-crawling bypass methods, etc.

Datasets.
To avoid the tedious task of web crawling and scraping, several works use existing datasets of forum and market data.These datasets come from leaked databases and public data repositories (containing either scraped, leaked or both).Data repositories allow researchers to quickly get started with data analysis without needing to set up infrastructure and collect data themselves.In addition, data repositories can hold data collected over a longer period of time, and support reproducible results.There are three main repositories for cybercrime forums and markets.The DarkNet Market Archives (DMA) is a repository originally collated by Branwen, with later contributions by others [22].It contains scrapes from 89 DarkWeb markets and more than 37 forums, totaling 1.6TB of data.The data is now quite dated (covering 2011-2015, with partial scrapes from 2017), and therefore not representative of the current cybercrime landscape.Still, these datasets serve as a good benchmark and are still being used in cybercrime research.This dataset has been used in more than 70 studies [22].
AZSecure is a repository providing selected hacker community content [13].It contains various datasets publicly available for researchers, obtained by scraping forums, markets, IRCs and carding shops.It contains data from 51 forums, totaling more than 32m posts, and 12 markets, totaling 249k listings.Also, it includes a dataset of hacking assets, i.e., attachments and source code.It spans more than a decade, with the latest dataset being from 2019.
The CrimeBB dataset of cybercrime forums is available to academic researchers under a legal agreement with the Cambridge Cybercrime Centre to prevent misuse [94].It contains data from over 100 million posts scraped from 34 forums, dating back more than 20 years, and it is kept updated through regular incremental crawlings.The same repository also contains other assets, like attachments obtained from Game Cheating communities [67], or contract data of actual tradings occurring in cybercrime marketplaces [127].The dataset has been used for at least 65 studies.The main strengths of this dataset is that it is of easy access, and provides frequent updates.Until recently, it was only provided as SQL binary dumps, which presented technical challenges for non-technical researchers [95].To address these challenges, a search interface was developed for interdisciplinary researchers [96].
Finally, some researchers also provide the datasets used for their studies to allow for reproducibility and foster new research.For example, Portnof et al. [100] and Durrett et al. [40] provide the leaked and partially scraped datasets used for the automated analysis of cybercriminal markets.Also, Yuan et al. [129] provide the processed data (originally obtained from the DMA repository) used for the understanding and analysis of jargon in forums.

Data enrichment.
To address data quality and validation issues, some researchers are able to verify data using external sources ('data enrichment'), with 21.7% of papers in the review using this type of data.For example, Vu et al. [127] veriied marketplace transactions using Bitcoin on the blockchain, and Pastrana et al. [92] used public reports of forum actors who had been arrested or prosecuted for cybercrime ofenses to as ground truth for actors of interest to law enforcement ('key actors').Almukaynizi et al. [6,8] used anti-virus or intrusion detection systems (IDS) attack signatures provided by security companies such as Symantec to validate the software vulnerabilities to be exploited in future.Similarly, to validate the results produced by cyber-attack prediction models, previous works have used historical records of cyber incidents recorded from the logs of real-world enterprises [7,79,80].Other data sources used for data enrichment include Twitter [111], CVEs [5,105], ExploitDB, and OSINT reports [5,111].

COMMON ISSUES AND LIMITATIONS
Research of cybercrime forums and marketplaces can be afected by data issues and method limitations.We systematize some of the approaches taken in existing studies, and types of limitations they face.For this section we rely heavily on our review of the papers as detailed in Table 1.Common issues and limitations can be split into two parts: challenges which are inherent to the ield, and limitations of prior work to be addressed by future research.

Challenges inherent to research on cybercrime communities
Data collection challenges.Crawling cybercrime communities poses several challenges, which increases the complexity of the process [125].Whether these challenges need to be addressed or not depends on the particular communities being monitored.Most common challenges are: • Accessibility.Gaining initial access to communities might not be straightforward [5].Some sites are open to everyone.Others require registration, often by means of a valid email address for account veriication.
Registration is often free of charge, but in some cases a fee is payable (see ğ4.1.1).
• Connectivity.To remain active, crawlers need to connect through various sources, typically using web proxies or VPNs.This provides robustness against IP blocking, and also preserves the anonymity of researchers.If the site is hosted on a hidden service only reachable using Tor, the crawler must provide proper circuit management, ofering various exit nodes in case some of the circuits fail [94].
• Bot-detection methods.Online sites might attempt to detect and ban bots.This relies on techniques such as anomalous detection of networking patterns, or monitoring accesses to old content (which probably no one visits).To prevent this, crawlers must mimic human behavior, for example randomizing accesses to content, establishing delays, or following human navigation and connectivity patterns.These techniques degrade the crawling operation [28].
• Anti-crawling methods.In addition to bot-detection, crawlers face anti-crawling methods [27].The most common is the use of Captcha challenges.These can be technically bypassed using automatic solvers, but these services are questionable from an ethical (and legal) view [86].A second option is to solve the captcha manually, keeping and re-using session cookies for following connections.Also, sites can implement traps to deter bulk-crawling methods, e.g., linking pages that lead to loops, or providing links that automatically delete the account.
• Maintenance.Online communities are dynamic.Despite changes in the content, which require implementing incremental crawling techniques (see above), the underlying HTML structure can change as well, i.e., as part of an update.Thus, crawling must be robust and modular enough to easily adapt to changes with low engineering efort.Also, it is desirable to put in place a logging system to facilitate status monitoring and error management [51].
Unstructured data.One limitation inherent to research of cybercrime communities is the lack of structured data, which thwarts its analysis at scale.Often, researchers rely on text analysis methods, or limit their research to just metadata.Other researchers focus on the social network structures found in forums and communities, but as the network is implicit, researchers rely on assumptions when building a network from post data.Language used on cybercrime communities contains jargon and speciic language, which can be due to obfuscation of illegal activities, or just due to the way they talk [129].Research methods using text analysis with 'of-the-shelf' NLP models trained on standard language may not perform efectively in this ield, and require further ine-tuning with manual annotations [26].
Use of leaked data.Leaked data provides researchers privileged access to non-public information, e.g., private messages or login details.These are useful for studies that aim to correlate public and private interactions [90] or analyze actual trading from private messages [87].However, using this data has drawbacks.First, there are ethical concerns (see ğ4.1.1).Second, it sufers from the 'observer efect': once actors know about the database leakage, they may change their patterns or move to another forum.Third, these datasets provide a single snapshot, and are rapidly outdated.
Incomplete and volatile data.Collected datasets, through scraping, leaking, or data sharing, may appear to be immediately useful.However, if researchers assume that the data is correct and useful without further inspection, this can lead to incorrect results [34].Further to the lack of structured data available, datasets collected may not necessarily be complete collections, and authors should take this into account when measuring the whole forum.This can be due to platforms preventing researchers from scraping them, by using bot detection techniques.Other adversarial techniques used by forums may try to slow down scrapers by using rate limiting, or use adversarial ML techniques to make text analysis and classiication diicult [125].In addition, a dataset may not be complete as older posts or boards can be deleted by users or moderators.In some cases, entire sections of a forum may be deleted by administrators, as happened in October 2016, when HackForums removed the łServer Stress Testingž section and banned booter services (which ofer denial of service attacks for a fee) from advertising [31].In some of these cases, repositories have been able to reconstruct missing threads from existing datasets [94].Some researchers choose to focus on a subset of communities, either due to limited resources (such as time) or to avoid working with large datasets which analysis tools may not handle.Still, diferent from other communities, cybercrime community research requires agile and often non-validated analysis methods to understand abrupt activities that occur.
Lack of ground truth.Usually, analysis is built on top of imperfect or missing labels.This presents issues with evaluation.Ideally, researchers have access to 'ground truth' labels, but these are typically manual annotations, and for large datasets this task is often outsourced to non-domain experts.This is highly time consuming, and is likely to be constrained by the availability of resources available to the researcher.This has implications for performance evaluations, as models may not necessarily be predicting what they were designed to predict.Also, data from cybercrime communities are particularly challenging to label because they often include non-conventional vocabulary, and users refer to their activities in difernt terms as one would expect in other communities (e.g., using jargon, or using coded sentences to disguise illegal behavior).
Also, researchers have used user-generated forum data to validate results, such as reputation voting data [82].Researchers may need to consider the if these are a trustworthy or untrustworthy signal, as votes can either be sent in good faith or be attempts to game the reputation system.Ground truth for validating results from key actor identiication or community detection requires researchers to either manually review results which can lead to subjective results, or use a metric to validate, which may not relect the deinition of key actor used by the researchers.External veriication and validation is one of the biggest hurdles faced by researchers, particularly given the adversarial nature of the cybercrime population, where there may be incentives to present a diferent version of the truth, and also the potential complexities to access external data from companies.
Ground truth for social network reconstruction is not possible to obtain with forum post data, as social connections between users are not explicitly available in comparison to social networks, e.g., using a friend or follower feature.It is, however, possible to build a proxy social graph using signals (e.g., replies in a thread), but these may not be accurate.Care must be taken to ensure results do not over or underestimate the scale of social networks.Alternatives to ground truth include unsupervised ML, which can overcome some of the diiculties with manual labeling.However this requires researchers to interpret the results of algorithms, potentially leading to bias.In some cases, researchers may use annotated data to evaluate unsupervised approaches [57].
Limitations of ML vs. other heuristic approaches.While ML can be used to automate some cybercrime community analysis tasks, the use of simple heuristics for classiication can outperform the time taken by a complex ML model, provided that the heuristics are a suitable alternative.Heuristics can be chosen using domain knowledge from reading posts on forums, instead of using ML techniques [93].Also, a hybrid approach could be used, such as using ML for classiication of products and replies, with heuristics to build up supply chains [19].
Due to the use of jargon and frequent spelling errors, use of of-the-shelf language modeling tools is infeasible [26].Creating new classiication models for posts, however, requires a large amount of training data for a well-performing model.It might be needed to manually annotate the intent and function of text with a group of annotators for agreement [26].Creating these gold-standard datasets can be resource intensive for both time and the number of domain-experts needed.However, these are important for validating and evaluating models correctly, and evaluation datasets and metrics need to ensure they can be used to check the ML models are predicting what it is expected to predict on.
Models trained on one dataset may not transfer well to other datasets, and training a model on a new dataset can be time consuming to both build the training set for and to train.Therefore, researchers may choose to repurpose a model trained on one dataset for use on a diferent cybercrime community.Durrett et al. [40] looked at the problem of domain adaptation within the scope of identifying products on four cybercrime communities, using both named entity recognition and slot-illing.They found models trained on some forums have better generalizability than models trained on diferent types of forums.This could also depend on which parts of the forum were annotated.They suggest improvements are needed, including for out-of-vocabulary words.Out-of-vocabulary words could be jargon speciic to one forum or words not present in the training data.

Ethical Considerations.
While the research community is yet to come to a consensus about the need for ethical review of passive cybercrime community data analysis, some norms have emerged.As shown in Table 1, some research is considered by IRBs (or equivalent), while others consider such research to be exempt.A minority of papers discussed ethics, with 13.0% of papers receiving an approval, and 8.7% receiving an exemption.There are also discipline-speciic guidelines.For computer science, the Menlo Report [68] is an important guide for ethical security research, while criminology has established guidelines developed by academic societies.In this section, we outline some of the common ethical issues considered in prior research.These primarily arise in relation to data collection, analysis, and reporting indings.Ethical considerations are particularly important for cybercrime community research, as there is usually no informed consent, coupled with the potential to create direct harm for research subjects, such as arrest and prosecution.
Data collection.There may be difering privacy and ethical considerations depending on how forum data are collected.Some issues that researchers may need to consider are making payments to obtain access, obtaining invitations to closed communities (sometimes by misrepresentation), and potentially breaking Terms of Service.Researchers will need to weigh the costs, such as payments fueling the criminal endeavor, against possible risks.There are also legal considerations.We note there is relevant case law from the US that has ruled that web scraping from public sites does not violate the Computer Fraud and Abuse Act.Researchers also need to comply with relevant privacy provisions if dealing with personal information that may be contained within forums and markets.
Leaked data may be particularly sensitive due to personal information (including victim data) being shared using private messages, and may also contain users' IP and email addresses.Many forums have a mechanism for users to send private messages to each other, such as for replying to product advertisements.While a web scraper will not have access to private messages, they may be included in leaked forum datasets.This not only creates opportunities for interesting research (e.g., [90,119]), but as the authors of these posts did not intend for them to be public, it introduces new ethical issues for researchers to consider [123].
When collecting data, researchers may wish to consider the resources being consumed during the process.Researchers can avoid the replication of data collection activities by using data repositories.Martin and Christin [83] suggest one way to compensate for consuming signiicant resources over the Tor network is by operating Tor nodes.
Some types of data may introduce additional risk to researchers, e.g., malware infection.Another concern is that some types of content raise legal issues, such as downloading child sexual abuse images.For these reasons, some researchers rely on text-data only, without downloading attachments [63].Pastrana et al. [93] downloaded packs of nude images used for eWhoring.They outline how they worked with their REB and the INHOPE hotline operator in their jurisdiction to implement guidelines to detect, report, and delete child exploitation material.While they had not anticipated inding such material, due to the precautions taken they were able to ascertain that such images were being shared on the forums.Data analysis.One common ethical issue is informed consent.Often, the data are publicly available, in that there are no restrictions in who can register and open an account.Researchers note it is not feasible to obtain informed consent from forum users for passive studies, where data are scraped or leaked.Prior researchers have pointed to the British Society of Criminology's Statement of Ethics [12,15,57,59,63,92,97,127].This justiies not obtaining informed consent if the data are collected from publicly available communities, and is used for research on collective behavior without aiming to identify individuals [23].The Menlo Report similarly contains provisions for when obtaining informed consent is impractical [93].In this situation, it is advised that researchers should seek a waiver of informed consent from their REB [68].
Actors in cybercrime communities tend to use pseudonyms, rather than identifying themselves using their real name.Pastrana et al. [93] note that usernames would be diicult to remove, as they are used in the text of posts, and doing so would not reduce any risk to forum users.In some situations, the data being analyzed may be particularly distressing.In these situations, the researchers may be able to take steps to reduce exposure to such content.For the work with nude images referred to above, Pastrana et al. [93] automated the analysis of images, to avoid manual analysis and review of pornographic content.
Reporting indings.Further precautions taken by researchers include not reporting indings that could potentially identify individuals (including not publishing usernames), and presenting results objectively.While some researchers decide to obfuscate the name of the forum they analyze, not all do.Vu et al. [127] justify this by pointing out forum characteristics can make obfuscation infeasible.Another justiication on scientiic principles is replicability.

Limitations and weaknesses of previous research
Previously, we discussed inherent issues which researchers encounter in this ield, which requires mitigation and a careful evaluation of assumptions.Next, we highlight limitations of prior work which need to be addressed by future research.
Lack of generalizability.Some of the prior work in cybercrime communities has focused on prediction and analysis.Techniques have been developed for these tasks on datasets of single communities, however these often do not work on other datasets.This limitation can also apply to other techniques and methods in cybercrime community research, as communities are not homogeneous.Results which are of cybercrime forums found on the 'surface' web can only relect cybercrime communities with open membership, and generalizations cannot be made to closed communities.
It is important for researchers to rigorously validate the predictions and analysis made by models, to check they are predicting what researchers expect them to.This can be achieved by using ground truth annotation data, or by combining their dataset with other sources.Using a test/train split of the dataset (with łground truth annotation dataž) could be used to check accuracy and F1 scores.Also, it is important to note that models trained on one forum may not generalize to others depending on the task.Annotations could be created for a new forum to perform a validation step for checking if a model can generalize.This idea is also common in general cybercrime literature: measurements of attacks on the internet are likely to pick up those occurring en masse, rather than speciic complex and targeted attacks, and indings have limitations when generalizing to other populations or scenarios [97].Also, while cybercrime forums were used to discuss coniguration iles for the Zeus banking trojan, it is likely such discussions exclude criminals that have well developed skills and who operate good operational security.Instead, the data may be more relective of recent entrants to the ield who may be more willing to post questions and share their experiences (although this does not make them any less interesting to research) [61].
Removal of outliers and validation of measurements.During validation, and checking of results from analysis and measurements, it is important to consider the removal of outliers.Some measurement studies have focused on the proceeds from crime.For example, previous research used prices on marketplace adverts to estimate this and identify top earners [117].However, this can have impacts on indings and conclusions, as measuring income based on advertised marketplace prices may not provide a useful result, only providing a rough estimation of proceeds, and may be afected by outliers in the dataset.Good practices, such as identifying and manually checking outliers are important for these tasks if using non-robust metrics, e.g.mean.Otherwise, a single outlier can afect summary results, so care needs to be taken during measurements to provide useful and accurate results [48].We found only a minority of papers in the review stated that they checked for outliers (15.9%).However, while other papers may have also taken this step, they did not include this in the paper.
Such prices in advertisements may be higher than actual trade amounts, where members may negotiate, and may not actually be sold at the price, where members may increase an item's price beyond its value to keep the advert online when out of stock (łholding pricež), to keep the adverts online [117].Including these holding prices in summary statistics of trading activity would considerably change results.For many papers, distributions are not explored, and in some cases it is not speciied that outliers were removed, in which case we assume they were not.Pastrana et al. relied on screenshots of PayPal and Amazon dashboards posted by eWhoring scammers aimed as 'proof of earnings' [93] to attempt to mitigate this issue, however this information is not complete and indeed could be modiied.
Trust mechanisms on marketplaces have typically relied on reviews of merchants ('feedback').Feedback on adverts may not specify quantity purchased leading to the assumption of one item only per feedback, or where feedback may not have precise timestamps, researchers may need to assume the advertised price was the same at which the feedback was left [117].Vu et al. [127] carry out measurements of a contract system, where only some contracts have public transaction information.To estimate trading activity, they assume that private transactions have at least the same value as public transactions.Cuevas et al. [34] explore this limitation further, by looking at the problem of carrying out measurements by proxy, where the ground truth dataset cannot be directly obtained.They build a model to measure the accuracy of measurements from scraped data, inding marketplace measurements provide a lower bound using this method, and recommend a high frequency of scraping to avoid missing data points.
Some marketplaces have introduced new mechanisms for trust, including using Bitcoin transactions on the Blockchain to log transactions that have occurred between members, or contract systems to show that transactions have taken place.However, care must be taken with these 'proof' systems, as vendors may also trade with themselves under dummy accounts to gain reputation [24].Vu et al. [127] manually check 163 highvalue transactions, including looking up bitcoin addresses to match transactions with contract timestamps, for validating these.While there may be 'ground truth' with transaction logs, or where there are only marketplace advert prices to measure, it is important for researchers to note marketplaces are an adversarial environment where reputation matters, and vendors are incentivized to over-report.Overblown claims are also seen across security research in general [10,11].
With text-based forums, measurements may exclude 'lurkers', members who only read posts, and may focus on members who create a high number of posts.However, these may be low-quality contributions, such as spam or plagiarized posts.Also, while it is straightforward to measure metadata of forum datasets, measurements of topics and other derived datasets require classiication and topic models.These are not perfect, and may add bias to results.Researchers should state their assumptions, including if they aim for an upper or lower bound estimate for a measure of activity.
Focus on English language forums.There is a considerable language bias in cybercrime community research.Most works in the ield has focused on English language forums, with 96.0% of the papers in the review focusing on this.There are considerably fewer studies of platforms using other languages (such as Spanish, Russian, and Arabic), which could potentially skew our perception of cybercrime as a result of where we are looking.
Stale data.Stale data can be an issue in the ield.It is time-consuming for researchers to collect datasets themselves, so they may wish to use shared datasets.However, these can quickly become old if scrapers are not maintained and datasets are not regularly updated.
Lack of ethical review.We ind the majority of papers do not discuss the ethics of research on cybercrime communities.Of the papers we reviewed, 11 had approval from a REB, and 6 were exempt.While we are unable to tell if researchers have considered research ethics during their work, this review highlights that researchers often neglect to discus ethical considerations in their publications.

RECOMMENDATIONS & FUTURE WORK
There is a body of interesting, novel, and useful research on cybercrime communities.There are many useful insights such research can provide.Researchers have had to battle many methodological issues, which we have outlined in this paper.From our indings of common issues and limitations in cybercrime community research, we recommend researchers to: (1) Note assumptions made by creating a structure with unstructured datasets of cybercrime communities; (2) Explore suitable approaches for using ground truth data, including use of domain experts in cybercrime research if available and external data sources, and state limitations; (3) Acknowledge that collected datasets are unlikely to be complete, due to the evolving nature of forums and marketplaces and may contain volatile data; (4) Note that datasets can become stale, and where they are unable to keep this updated, state the limitations of using this for analysis; (5) Be aware of underlying bias in datasets, such as a skewed perception of cybercrime by only analyzing English-language forums; (6) Consider limitations of the generalizability of models to other cybercrime communities, including where NLP and ML models may overit to bias; (7) Take care to verify and validate results, including removing outliers from analysis and checking for anomalous results with measurements of cybercrime communities; (8) Consider and set out the ethical case across all cybercrime community research; and (9) Take care when reporting the methods used to enable other researchers to replicate research, which may also involve sharing both datasets, which may vary depending on when the data was collected, and code.
Further challenges that researchers are likely to face are explored below.These are likely to arise due to community fragmentation following disruption and displacement, further lack of structure with the move to 'micro' communities, and adversarial attacks.
Displacement.Online forums and marketplaces have experienced displacement following law enforcement action.For example, Silk Road was once most widely used cryptomarket.A police investigation resulted in the arrest and ultimate prosecution of the operator.Soska and Christin [117] analyzed what happened after the takedown.Within a month, SilkRoad 2.0 was set up, operated by former administrators and vendors of the original Silk Road.Within a few months, numerous marketplaces followed the same model.They varied in levels of sophistication, durability, and specialization.Some marketplaces disappeared due to law enforcement action.Some disappeared voluntarily, including 'exit scams', where they ran of with what had been sent to the site administrators in escrow.They concluded that the Silk Road takedown resulted in signiicant evolution of the marketplace ecosystem, compared to when Silk Road was a monopoly.
A similar phenomenon occurred after the łstresserž subforum on Hackforums was removed.As a result, the community became increasingly decentralized, spread across multiple Discord servers and Telegram chats [32].Some of these are also used by booter operators to provide support for their customers.The shift from a centralized community to many smaller communities has introduced further issues when trying to capture an overall view across the ield, as researchers cannot be certain they have identiied all popular communities.These 'micro' sources require more efort to scrape, as researchers need to know which are of interest, check if they are still active, and look out for displacement of members to other servers and chat channels.
Further loss of structure.In addition, data from chat channels are less structured than those from forums and marketplaces.At most, they may be broken down into channels, similar to boards found on cybercrime forums.Within these, conversation threads are mixed together.Users may talk about diferent topics simultaneously, one user may switch to a diferent topic, and conversations may gradually switch topics, where on cybercrime forums, a new thread would have been created instead.Thread disambiguation is not a trivial task, and adds to the complexity of analyzing these data sources.
Attacks on NLP and ML models.In the future, researchers may also need to pay attention to a new class of poisoning attacks against NLP models.Such attacks may require researchers to sanitize data to minimize the likelihood that forum users can successfully use imperceptible text-encoding attacks to disrupt ML models used for analysis [21].

CONCLUSION
This paper explored the goals and methodological approaches used in cybercrime community research.Prior research has been useful in addressing cybercrime problems, often with novel methodologies.We highlight both challenges inherent to research of these communities, and limitations and weaknesses of prior research.We proposed a list of recommendations, which includes researchers setting out and ethical case across all cybercrime community research, raising awareness of method limitations for predictive tasks, and consideration of steps needed to validate and verify results in studies.Future work may be afected by a changing landscape, including displacement of communities to smaller platforms, a further loss of structure to datasets, and new attacks on NLP and ML models in use.

Table 2 .
Summary of goals and ethical issues considered in previous works on the data collection process (✓=documented, -=not documented,not specified or not applicable, C=Complete, P=Partial)