A Tale of Two Communities: Exploring Academic References on Stack Overflow

Stack Overflow is widely recognized by software practitioners as the go-to resource for addressing technical issues and sharing practical solutions. While not typically seen as a scholarly forum, users on Stack Overflow commonly refer to academic sources in their discussions. Yet, little is known about these referenced academic works and how they intersect the needs and interests of the Stack Overflow community. To bridge this gap, we conducted an exploratory large-scale study on the landscape of academic references in Stack Overflow. Our findings reveal that Stack Overflow communities with different domains of interest engage with academic literature at varying frequencies and speeds. The contradicting patterns suggest that some disciplines may have diverged in their interests and development trajectories from the corresponding practitioner community. Finally, we discuss the potential of Stack Overflow in gauging the real-world relevance of academic research.


INTRODUCTION
Stack Overflow (SO) is a popular Q&A platform among software practitioners and developers [1] for discussing technical issues [19].These discussions might go beyond just finding quick technical fixes, as SO users often delve deeper into the problems at hand and share their own knowledge and expertise in the process [18].Organically, SO users refer to a wide range of sources to support their claims [6], including academic articles published in conferences and journals.Such practices suggest that SO might be acting as a potential bridge between the two communities-developers and academic researchers-for knowledge dissemination.
However, we currently lack an understanding of these referenced articles and how the SO community engages with them.This limits our insights into SO's role in facilitating knowledge transfer and reflecting the practical relevance of academic research.Investigating this gap could uncover potential discrepancies between cutting-edge research and prevailing industry norms and interests.It could also guide both communities in identifying how academic advancements can match up the needs of the SO community and the broader developer community it serves.To the best of our knowledge, this is the first study to examine the recognition of academic research within SO, and to investigate how scientific knowledge is consumed or utilized for solving real-world software development challenges across different communities of interest on SO.
Prior works have studied academic references on Twitter [16] and Wikipedia [17], emphasizing the general visibility and popularity of scientific findings rather than their practical relevance.Following this trend, Altmetric.com[8] included SO citations as one of its alternative metrics for measuring academic impact.However, it only quantifies how frequently academic articles are mentioned but overlooks the context of these references and their connections to the original discourse.As a result, Altmetric suffers from a similar limitation of serving merely as an indicator of visibility.
In this paper, we explored the landscape of academic references on SO and took an initial step toward assessing their potential to reflect the practical relevance and utility of academic research.We sought to answer the following research questions (RQs): We sifted through 44 million URLs on SO and identified 15,009 references to 10718 unique academic articles.Leveraging topic modeling and social network analysis techniques, we examined the patterns of interaction between SO communities with diverse technical interests and academic research from various fields.To support future research, we have made our dataset publicly available [4].

METHODOLOGY
Unlike Wikipedia or academic settings where citations follow a standard format, users on Stack Overflow (SO) commonly use hyperlinks to reference academic sources.This practice makes it challenging to distinguish academic references from other types of links, as both appear as bare URLs without bibliographic information [1].In this section, we describe our heuristic method for identifying links leading to academic articles and the data collection process.i) Filtering for Potential Candidates.Instead of examining every external link on SO, we narrowed our search to those that are most likely to host academic documents.Prior efforts [8] have limited to links containing recognized identifiers such as DOIs and ISBNs.We significantly expanded this scope by incorporating links originating from recognized academic repositories, e.g., the ACM Digital Library.
Links containing DOIs were identified using regex.For the remaining, we checked if their web domains appeared on a list of domain names affiliated with academic repositories.In compiling this list, we considered various academic entities, including publishers (e.g., Springer), academic societies (e.g., ACM) and databases (e.g., ResearchGate).However, given the sheer number of these potential sources, it is infeasible to include them all.Instead, we opted for a best-effort approach, focusing on the most significant ones.To better align with Stack Overflow's emphasis on software and computing, we included all 31 publishers indexed by DBLP, which is a major Computer Science bibliography.With inputs from experts and authoritative sources [9], we further enrich the selection with 13 academic societies renowned for organizing prestigious conferences (e.g., CVPR) and ten well-known academic databases.One author then iteratively curated all relevant web domains belonging to these entities (e.g., aclanthology.organd aclweb.org of ACL).
Although our selection 1 is not exhaustive, it effectively covers a substantial part of the academic publishing landscape.The 31 included publishers issue over 16,000 journals in various fields [11], and the 13 academic societies host more than 500 conferences annually, not to mention the extensive reach of the included databases.
ii) Retrieving and Validating Bibliographic data.For every candidate link identified previously, we assumed its content to be academic and attempted to retrieve possible bibliographic information.For links directing to PDF files, we extracted titles and DOIs using Grobid , a popular tool for parsing scientific documents.If the link leads to a webpage, we assumed it to be the landing page of a research article, and retrieved its potential titles from various HTML tags, such as <h1>, <meta property="og:title">, <title>, <h2>, etc.Additionally, we searched for DOIs within the HTML file using regex.For non-functional links, we accessed their archived versions via Internet Archive's Wayback Machine .
In cases where the candidate link does not lead to academic content as we assumed, the surmised bibliographic data extracted heuristically from HTML or PDF files would be spurious.With this in mind, we can filter out ineligible candidates by cross-validating their surmised metadata against two major academic databases: Semantic Scholar and OpenAlex [12].A matching record in the title or DOI will confirm the academic nature of the candidate link.We then gathered detailed metadata for verified academic references from the two databases, including abstract, venue, citations, etc.Data Collection.Based on the official Stack Overflow data dump released on December 8, 2023, we extracted 44 million URLs found in the edit histories of 59 million posts (totaling 160 million revisions).Similar to prior work [5], URLs embedded within code blocks were omitted, as they are often irrelevant to knowledge sharing.After this refinement, we obtained a dataset [4] of 30.9 million links, from which 15,009 references to 10718 academic articles made by 12,963 posts were identified through the previously described process. 1 Details on the included academic repositories and methodology are available at

RQ 1: What academic articles are cited on Stack Overflow?
For a fine-grained characterization of the cited articles, we categorize each article into its corresponding Field of Research (FoR) using OpenAlex's concept tagging model .This model analyzes the abstract and title and generates an initial list of relevant research fields, along with a confidence score.The fields are assigned in six levels, from the broadest to highly specific, following Wikidata's taxonomy.For each article, we selected the second level field (e.g., World Wide Web) with the highest confidence score as its FoR.
The articles referenced on SO span 218 FoR across 19 disciplines.Table 1 presents key characteristics of the ten largest fields in terms of SO reference counts.Notably, Artificial Intelligence (AI) makes up over 30% of all references.These AI articles are impactful and recent, averaging 1331 academic citations and an age of 3.75 years at the time of mention.In contrast, SO references from fields such as Programming Languages and Algorithms tend to be older on average.Such variations indicate the diverse patterns of interacting with academic research on SO, from embracing cutting-edge developments to relying on more established and foundational works.♣: the duration between publication and the time of every reference ♠: the average academic citation counts of referenced articles ★: h5-index [9] retrieved from Google Scholar; all venues ranked within top 20 of their respective fields, except arXiv Table 2 lists the 15 venues with the most SO references.Consistent with existing studies on Wikipedia citations [17], these venues are highly regarded in their respective fields (★), with an interesting exception of arXiv*.We manually inspected a random sample of the referenced articles from arXiv and found that many of them were later published in peer-reviewed venues with slightly different titles.An explanation can be that Stack Overflow users tend to (i) integrate academic insights at a fast pace, or (ii) prefer open-access content.We observe a similar dominance of AI-related research (e.g., ML, NLP, CV) among the venues, with 12 out of the 15 venues and 51% of the articles on arXiv being related to AI.

RQ 2: Which parts of Stack Overflow rely on academic sources?
We discerned which technical domains and user communities were associated with the posts citing academic literature on Stack Overflow by analyzing the overarching themes of these discussions.Although using the user-generated tags attached to each SO post for topic modeling may seem intuitive, prior research suggests that such tags often fall short in reflecting the actual discussions accurately [15].Moreover, SO has over 65,000 existing tags that are too parochial to capture each post's broader themes and areas of interest.Following existing works [14], we utilized BERTopic [3] to categorize the discussions into coarser technical domains.Initially, the model was fine-tuned for better granularity and coherence, producing 109 preliminary topic clusters.Two authors then discussed and qualitatively merged related topics into broader domains.For example, clusters about Named Entity Recognition and Sentiment Analysis were grouped under the domain of NLP (D3).Eventually, we compiled all SO topics into 16 technical domains that serve as focal points of interest for distinct communities on SO.
The leftmost column in Figure 1(a) shows the number of posts containing academic references within each technical domain (Yaxis).Notably, the domains of D0 (machine learning), D1 (vision/ graphics), and D2 (algorithms) were most actively incorporating academic knowledge.Whereas more traditional and applicationoriented domains such as D15 (data communications) and D14 (computer architecture) exhibit less integration of academic research.

RQ 3: How do various Stack Overflow communities interact with academic research?
To examine how different SO communities interact with academic research, we mapped the citation flow between 218 research fields (FoR) and the 16 technical domains.However, the resulting bipartite network is hard to interpret and visualize due to its high dimensionality.We simplified its complexity by aggregating the 218 research fields into seven broader Disciplines.The simplification was carefully executed to preserve the underlying citation patterns, guided by two criteria: (i) the similarity in how different FoRs are referenced together across technical domains, signaled by the pairwise Spearman's correlation [7] (where high correlations suggest similar citation patterns), and (ii) their hierarchical relationships within an FoR ontology [13], as to ensure the aggregation also respects established disciplinary structures.For example, Computer Network, Distributed Computing, and Comp.Security were grouped into the Syst/Netw/Sec category for their high inter-correlations in citation patterns (above 0.6) and shared academic lineage.
Figure 1(a) illustrates the relationships between 16 technical domains and seven research disciplines.Cell (, ) denotes the percentage of papers referenced by domain  that originate from discipline .For example, 40.8% of the articles referenced in domain D10 are from the Math/Theory discipline.This figure reveals that SO discussions typically rely on one single research discipline, with notable exceptions being D7 (web scraping) and D11 (data visualization), where the distributions are relatively even.We analyzed sample posts from these two domains and discovered that posts within D7 often seek guidance on downloading research articles or scraping bibliographic data programmatically, while the discussions in D11 typically revolve around replicating data visualizations in scientific articles.These activities (downloading and visualizing) are universal and not confined to any field, leading to the balanced distribution observed.In such cases, academic references primarily serve as illustrative examples or supplementary materials, rather than as integrated sources of knowledge for problem-solving.
We further explored which articles were referenced together on SO and mapped the structure of scientific knowledge through a co-citation analysis.Each node in the co-citation network represents an academic article and is connected to another if they were jointly referenced by the same SO post.The resulting network was highly fragmented and sparse, with 2541 nodes (23.7% of all cited articles on SO) connected by merely 3089 edges.This limited connectivity likely stems from the highly specialized nature of many SO discussions [20], which often demand niche and domain-specific knowledge, resulting in less overlap among the cited articles.Confirming this hypothesis, however, requires a more detailed contextual analysis of the isolated dyads and fragments within the network, a task we reserved for future studies.
Meanwhile, our current study focused on the ten largest components in the co-citation network, comprising 381 nodes and 1358 edges, as depicted in Figure 1 Generative Adversarial Network .Our qualitative analysis found that users often cite these articles alongside others in a post to provide essential background knowledge for understanding its content.A practice that makes academic findings more accessible to a wider audience, indicating SO's role in bridging the knowledge gap between forefront academic research and software practitioners.We analyzed the pace at which academic articles were referenced on Stack Overflow by tracking the "First-cite Interval" -the time lag between a paper's first mention on SO and its publication date.This backward tracing approach circumvents the potential right censoring bias [2].On average, it takes an article 6.6 years to be recognized on SO, although this interval varied significantly across research fields.For instance, AI-related papers were typically referenced within 3.7 years, whilst PL papers have a longer latency of 9.3 years.Figure 1(c) further illustrates the trend of diffusion rates of three major research fields (see Table 1) over time.Prior to 2010, these fields exhibited similar first-cite intervals of around 6-10 years (Note that SO was launched in late 2008).However, from 2010 to 2017, the rate at which AI articles were referenced accelerated rapidly.This timeframe coincides with significant breakthroughs in AI, such as the Adam optimizer, Batch normalization, Attention mechanism, etc. Conversely, the First-cite Interval for Algorithms papers saw little changes, while that for PL papers even noticeably increased, implying that earlier foundational works in these areas might hold more relevance in Stack Overflow discussions.

DISCUSSION AND CONCLUSION
This study presents the first large-scale analysis of academic references on Stack Overflow, aiming to understand how scholarly knowledge appears in this practitioner-centric community.In this section, we discuss the implications of our results.
Divergent trajectories.Our analysis suggested that the research trajectories in certain fields do not always align with the practical discussions on Stack Overflow.For instance, discussions in D15 (data communications) referred mostly to computer network articles (as expected), but the number of references is low, and there's a noticeable preference for older publications.This hints at a potential disconnect from the latest developments in this field.Additionally, a significant number of the articles referenced in domains such as D13 (floating point), D14 (computer architecture), and D15 were in fact technical documents (with DOI) like IEEE Standard Specifications and IETF Request for Comments (RFCs).These documents are meant for establishing technical foundations and addressing engineering challenges [10], rather than presenting new scientific findings.Considering that systems and networking are relatively mature fields with close ties to the industry, it is reasonable that professionals in these domains find greater value in well-tested, experiential knowledge over yet-to-be-proven academic innovations.
Feasibility as an Altmetric.Academic references on Stack Overflow offer a novel lens to observe how scholarly knowledge diffuses into the developer communities, suggesting their potential as an alternative metric (altmetric) for gauging the industry impact and practicality of academic research.
We observed a tangible link between the volume of SO references and the impact of scholarly contributions in real-world settings.For example, articles that emerged as the central nodes in the SO cocitation network are typically groundbreaking seminal works with extensive application in the industry.Furthermore, the time interval between an article's publication and its acknowledgment on SO is comparatively shorter than that observed with other altmetrics, such as Patent [2] and Wikipedia [17] citations, suggesting that SO references may offer a more immediate measure of the impact.
However, there are challenges to consider.For instance, not all references signify an effective transfer of scholarly knowledge, as seen in D11 (data visualization) and D7 (web scraping), where academic references serve merely as contextual support.As such, to make SO a meaningful indicator of research impact in the industry, future research needs to develop rigorous methods to evaluate the intention and contribution of these academic references.Specifically, to contextually assess whether a SO reference genuinely facilitates the diffusion of knowledge and provides tangible solutions to the challenges faced by software practitioners and developers.

Figure 1 :
Figure 1: (a) Heatmap depicting the relationships between technical domains (Y-axis) and research disciplines (X-axis).The blue bar along Y-axis shows the number of posts within each technical domain.Cell (x, y) denotes the percentage of papers referenced by domain x that originate from discipline y.(b) Ten largest components in the co-citation network of academic references on SO. ①-④ are key nodes with the highest PageRank score.(c) The average "First-cite Interval" (Y-axis) for articles in three major fields (FoR) cited in SO posts each year (X-axis).
(b).The size of a node reflects the article's PageRank score, denoting its importance and influence within the network.Notably, we observed that pivotal and pioneering works-those that lay new foundations and advance the field-are associated with highest PageRank scores.Noteworthy examples include papers that introduced the ① Transformer Architecture , ② BERT language model , ③ Deep Residual Network , and ④

RQ 4 :
How quickly does Stack Overflow integrate academic research?

Table 1 :
Top 10 Fields of Research (FoR)