The Content Quality of Crowdsourced Knowledge on Stack Overflow- A Systematic Mapping Study

Community Question Answering (CQA) forums such as Stack Overflow (SO) are a form of crowdsourced knowledge for software engineers who seek solutions to development and programming challenges. While such a forum provides valuable support to engineers, it often contains low quality content that impacts users' experience and the longevity of new users. Past research shows that most of the low-quality content comes from violating general Netiquette Rules (NRs). In the past, several researchers have worked on analysing the content of SO and suggested approaches to increase its quality. However, to the best of our knowledge, there is no previous work that has reviewed the scale of scientific attention that is given to this cause and the recommendations that have been made. We have conducted a Systematic Mapping Study (SMS) using five relevant databases, reviewing 1,489 papers and selecting 18 that are relevant to help to address this gap. We have found that SO has attracted increasing research interest on reducing NRs violations to improve the quality of communication on SO. Interestingly, the majority of papers used manual qualitative and quantitative analysis approaches to investigate this area. We have found that further research is required to identify more violation features, generalisable sources of data and that the use of computational analysis approaches are still needed in this area.


I. INTRODUCTION
Previous research has shown that social media has become unwelcoming [1] and that violating Netiquette Rules (NRs) and community norms drives users away from active engagement on platforms such as Twitter [2]- [5], Facebook [6] and on Community Question Answering (CQA) forums such as Gitter [1], [7], GitHub, Slack [1] and the focus of this paper, Stack Overflow (SO) [1].The concept of netiquette assumes that online communication has its own set of rules such as Shea's NRs [8] and community-specific norms [9].Using pleasantries such as apologies, over politeness, and gratitude are considered SO-specific norms (SO-Norms) violations [9].
As contributor participation is key for these platforms, content quality concerns including NRs and SO-Norms violations are becoming of increased importance and are difficult to detect and manage.

A. Community Question Answering in Stack Overflow
Crowdsourcing has become an effective way of solving problems with individuals turning to the internet for help and solutions.Within computer science and software engineering, platforms including TopCoder [10], Bountify [11] and SO are available for practitioners to solve their domain specific problems.However, SO tends to be the preferred choice for practitioners [12] with more than 100 million users and 23.8 million questions placed as of June 2023 [13].
In order to increase the content quality of SO, human moderators manually detect and delete posts contaminated with NRs and SO-Norms violations.As the human moderation process is laborious, SO has introduced a Heat Detection (HD) bot in 2017 1 .If a post violates the NRs and SO-Norms, the HD bot flags the post and notifies the moderators.Moderators in turn assess the post and if the flagging appears to be legitimate, they delete the post.However, discussions between the askers and answers are written in natural language sentences which may contain hedges, sarcasm and emojis [14] that HD bot may find difficult to spot.Therefore, writing high-quality posts free of NRs and SO-Norms violations is a responsibility mostly left to SO practitioners.
Research has shown that violating NRs within social media impacts users' engagement and longevity in these platforms [1]- [7].There have therefore been many efforts to detect NR violations and improve the quality of crowd-sourced platforms, but there has been limited attention to the scale and overall evaluation of this research area.Therefore, the aim of our research is to study the literature that has investigated the violations of different NRs set by Shea [8] and SO-Norms [9] in SO.With this aim, we carried out a Systematic Mapping Study (SMS) following the Kitchenham method [15], finding 18 relevant research papers focused on the violation of one or more aspects of NRs and SO-Norms in SO.We believe that this SMS will help in providing an overview of existing tools and techniques used to detect NRs and SO-Norms on SO and hence, provide a base for researchers to advance this field.

B. Research Questions
This study provides a summary of the level of academic interest related to the content quality concerns such as NRs and SO-Norms which SO has generated, the publication venues that are targeted, approaches used and types of contributions that are provided.The following are the three research questions that we answered in this study: RQ1: What are the types of NRs and SO-Norms violations in SO platform that are identified or reported in the selected studies?
RQ2: What level of academic interest related to NRs and SO-Norms violations has the SO platform generated over time?
RQ3:What are the methods utilised to investigate, identify, moderate or remove the NRs and SO-Norms violation on the SO platform in the selected studies?
The remaining sections of this paper are organised as follows.In Section 2, we provide details of the methodology used in conducting this SMS.In Section 3, we discuss the results obtained in this research.In Section 4, we discuss the main findings in our research study and their implications.In section 5, we provide potential future directions.Finally, in section 6, we summarise our SMS.

II. METHOD
This study follows an SMS process which "provides a structure of the type of research reports and results that have been published by categorizing them" [16], an overview of research conducted, and helps identify research gaps.

A. Manual and Automatic Search
In order to identify and refine the relevant keywords and terms used in the SMS, we began by manually searching digital databases using the terms 'Stack Overflow' and 'quality concerns'.We then read the top ten papers and found that there are many published on 'buffer overflow' rather than 'Stack Overflow'; consequently, we added the term 'Community Question Answering' to eliminate papers related to 'buffer overflow'.We also limited the search to 2009-2023 due to SO being created at the end of 2008 [1].Based on the results obtained and research questions formulated, the terms 'Stack Overflow', 'Community Question Answering' and 'quality' were identified.We then used these terms to search the following databases: Google Scholar, ACM Digital Library, IEEExplore, Web of Science and ScienceDirect with records (1010, 80, 1262, 374 and 56 respectively).After reading the full text, we selected the first five papers from each database as our test set.If the test set appears in the final results using our automatic search string, then the string found is accurate.
To conduct our automatic search, we constructed the search strings as suggested by Kitchenham [15]: ("Stack Overflow' OR 'StackOverflow' AND "Community Question Answering' OR 'Community-based Question Answering' OR 'Community based Question Answering' OR 'Question and Answer" AND "quality' OR 'standard' OR 'perform*' OR 'measure' OR 'benchmark").We inspected the title of the 1489 records and found that the test set appears in our automatic search.We selected the following databases: Google Scholar, ACM Digital Library, IEEExplore, Web of Science and ScienceDirect in our search.Our selection follows the selections of other SMSs or systematic literature reviews in the software engineering discipline [17] [18].Therefore, We used the same databases we used in our manual search above.

B. Data Extraction
All publications that could not be used to answer our research questions were eliminated during the review process as suggested by Kitchenham [15].Table I contains the inclusion and exclusion criteria which we applied to make these decisions more objective.From the 1489 results, we selected potentially relevant publications based on their meta data and full text.Once we removed the duplicates, we applied the inclusion and exclusion criteria to the remaining 1247 records, removing 156 records as shown in Figure 1.We then read the abstract and conclusion of the papers, removing 850 records that are not relevant to our research focus.We then read the full text of the remaining of 241 papers and found that only 40 papers were based on the content quality of SO.We then applied the quality test suggested by Kitchenham [15], conducted by the second and third authors of this paper and found a total of 18 publications ( [1], [9], [19]- [34] respectively) that are relevant to our research focus.

III. RESULTS
In the following subsections, we present our results according to our research questions described in Section B above.

A. Netiquette Rules and SO-Norms in Stack Overflow (RQ1)
The NRs an SO-Norms violations extracted from the selected papers are of two types based on the NRs practice be it negative and/or positive.As shown in Table II, there are 13 NRs and 6 SO-Norms violations found in the selected papers.
Results, in Table II, show even in SO, a technical discussion forum, people may violate the generic NRs using abusive and offensive words very frequently, while words appertaining to racial abuse are not frequently discernible [9].Furthermore, users on SO accuse one another as being spam, sarcastic or rude, and moderators should take consequential action [9].The moderation process, on the other hand, is found to be not instantaneous [9] as moderation requires multiple users flagging the content with NRs violations (e.g., harassment words) as offensive.
Although features under 'SO-Norms Violations' in Table II are considered positive or neutral in normal circumstances, pleasantries including apologies, welcome, over politeness, and gratitude are considered SO-Norms violations and must be flagged [9].As a consequence, moderators delete pleasantries as 'no longer needed'.This indicates that in technical forums such as SO, the majority of users do not like formality and over politeness.This is not the case with female users who prefer using pleasantries in their comments [31] in order to encourage answerers to generate higher quality content in the future, and increase users' engagements in SO [29].Norms violations started in 2014 and gradually increased with most attention gained within the past three years.The research interest slowed down in 2017 when the 'Heat Detection' bot was embedded in SO in 2017 [1].As shown in Figure 2, interest in this topic has increased in general though since 2014, and the evidence suggests that detecting and removing NRs SO-Norms violations in SO is currently an active topic.

C. Approaches Used in the Selected Papers (RQ3)
To answer RQ3 and for the scope of this paper, we looked into the approaches in general that are used in analysing NRs violations and found that 7 papers used manual qualitative and quantitative analysis methods which made a contribution of 38.89% in our SMS (depicted in row one of Table III).The next 61.11% of the papers are based on automated machine learning approaches such as Neural Networks [24], [27], [30], Logistic Regression Analysis [20], [25], [26],Random Forest and Support Vector Machine [1] and analysis tools such as SentiStrength [19], [27], [23].While in [28], a machine learning technique tool (Scikit-learn) with clustering techniques are proposed to be used as future work.

IV. DISCUSSION
The aim of this SMS was to collect, analyse and interpret existing evidence pertaining to NRs and SO-Norms violations in SO.To the best of our knowledge, there is no previous systematic review or mapping study performed in this area.Out of 1489 papers found on content quality of SO, only 1.21% of papers were found to be related to NRs and SO-Norms violations in SO.Many interesting results are found in the selected papers representing NRs and SO-Norms taxonomy as shown in Table II and approaches to improve the content quality and consequently users' experience and engagement in SO as shown in Table III.
Many researchers investigated what and how community norms such as SO-Norms and NRs violations affect the content quality of SO [1], [9], [19]- [34].Analysing the literature in this SMS, we found that there are two types of violations as shown in Table II: generic NRs violations and SO-Norms.Nine potential reasons that affect engagement of new users in SO [32] are mainly related to NRs violations.Results in our SMS show that users violate the generic NRs using unwelcoming or abusive words very frequently and racial abusive words less frequently [1].Rude comments, for example, are flagged and deleted quickly by the moderators, but even in that situation, users end up reading these comments against them.This makes new users, who are not accustomed to the culture of SO, feel frustrated and unwelcomed, and consequently leading them to leave the community [32].On the other hand, deleting low quality content can lead to negative behaviour, debates and frustrations explicit in users' comments [30].Eventually, the negative behaviour could easily result in hurting SO's users, and apparently making the SO community hostile and unsupportive to new users.Therefore, there is a need of a prevention strategy to detect and delete NRs violations content from the users' comments [30].
In addition, positive comments and affective lexicon including apologies, wishes, overly politeness, jokes, gratitude are considered SO-Norms violations and must be flagged and deleted by moderators as 'no longer needed' [29].50% of comments containing gratitude expressions belong to new users of SO due to their unfamiliarity with SO's norm [25].Calefato, et.al. found that regardless of user reputation (e.g., up/down vote count), successful answerable questions are the ones that have a neutral emotional style and do not contain abusive features such as unnecessary use of uppercase characters [26].On the other hand, Jiarpakdee, et.al. found that community-based features (e.g., up/down vote count, offensive vote count, spam vote count) and affective lexicon play an important role in the question quality identification and question answerability [23].Therefore, questions that are clear, direct interrogative, neither overly polite nor rude are more answerable than questions with an overly polite tone [20].This indicates that in technical forums such as SO, the majority of users, except for females [31], do not like formality, gratitude and/or over politeness.
Despite the availability of HD bot and community modera-tors in SO [1], many offensive contents bypass the moderators and the bot by obfuscated offensive words, and consequently offensive language still exists (0.14%) and 0.3% of the total comments are abusive.Therefore, many techniques and tools have been developed to detect and remove NRs and SO-Norms violations in SO.Novielli, et. al., for example, suggested a tool where emotional interface designs are proposed based on emotional intelligence that can be embedded in SO platforms [19].Similarly, Maftouni, et.al. proposed a 'Support' button embedded in the SO that minimises toxic behaviour [34].The button allows new users to report rude comments, express their opinions without waiting for a moderator to approve their report.In addition, as females post more apologetic contents in SO than men [31], the 'Support' button could help embrace women as well as users with more collectivist attitudes.Likewise, a 'Conflict Reduction System' is proposed to identify offensive words and suggest changes to minimise offence comments [1].

V. FUTURE WORK
The results shown in Table II are a list of certain NRs and SO-Norms violation features.For example, affective states (e.g., emotions, moods, harsh comments, avoid offensive behaviour, negative emotions) that presented in [21] are based on textual cues as the potential factors of success for an answer.However, there might be communication features other than the textual cues, Calefato, et.al. found that are not explored in this study using a dataset from SO as well as other CQA platforms [21].In order to identify NRs and SO-Norms violation features, researchers could use emoticons [35] and emojis [36] features in addition to the textual cues.Moreover, to obtain robust identification results, researchers may fuse results from both the textual cues identification system and the emoji/ emoticon identification system.
Castelle, for example, found 23 comments contain offensive language but do not receive an 'offensive' flag [27] from users, human moderators or HD bot in SO.This could be due to users in SO write comments with hedging or abbreviations.In addition, many comments are considered offensive in context but not offensive when standing alone and thus, neural network methods perform poorly [27].This indicates that there is a research gap in the automatic detection of offensive words that are hedged within positive words of posts in SO.
Another important result found in our SMS showed that the moderation process in SO is not instantaneous: to delete a post with NRs and SO-Norms violations content, moderators required responses from multiple users flagging the post as offensive.Nine comments with personal harassment words were not removed from SO [9].While users reacted to NRs and SO-Norms violations in less than 2 hours, 76.35% of posts containing NRs and SO-Norms violation do not get downvoted at all.On the other hand, 148 of the original posters reacted to the concerns of potential NRs and SO-Norms violations by editing their posts.The slow process of flagging and moderation in SO inspires Chen, et.al. to adopt a collaborative editing mechanism and fix posts with SO-Norms violations prior to posting them on the forum [24].

VI. SUMMARY
This study aimed to build a comprehensive view of what the literature has reported on the factors that affect the content quality of SO, users' engagement and inclusivity.With a thorough selection process, a total of 18 out of 1489 papers were selected to investigate NRs and SO-Norms violations in SO.Our SMS indicated that there was an increase in studies on NRs and SO-Norms violations in SO and their influential factors.In addition, large datasets are needed not only to validate the results found in the selected papers, but also to identify communication cues including textual, emoji and emoticon embedded in CQA platforms.Furthermore, automated approaches are needed to validate the results in papers that used manual approaches.

B
Fig. 2. Academic Interest related to NRs and SO-Norms Violations

TABLE II TYPES
OF NRS AND SO-NORMS VIOLATIONS