Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?

Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files.


INTRODUCTION
Newcomers to Open Source Software (OSS) projects encounter several barriers to making their first contribution [73].For example, an overly complex codebase or a workspace that is challenging to build saps newcomers' motivation to contribute [76].Research shows that these barriers discourage newcomers, who often give up before completing a single contribution [48].
To onboard, newcomers usually consult the project's documentation or contact the project team [22,34,41].Yet project members are busy making their own contributions, can only help a limited number of newcomers at a time, and may not be able to manage synchronous communication due to time zone differences [23,35].For onboarding newcomers, appropriate documentation is more efficient and scalable [71,74].
Unfortunately, most OSS projects' existing documentation is either low quality or non-existent [1,41,76].Some studies point to problems such as documentation files that are incorrect, incomplete, and outdated [1,48].Other studies identified further documentation barriers for newcomers, including unclear, and scattered documentation, with information overload from unimportant information sharply contrasting with missing necessary information [73].These and other documentation deficiencies impact all contributors but have more impact on newcomers since they need to orient themselves in a new environment [34].
Previous work [74,78] showed that newcomers found themselves more oriented and understood the process better when the right information was provided in an organized way.However, little work has specifically focused on enhancing newcomers' documentation and identifying what information is missing from the contribution guidelines [74].With the goal of improving this situation, we automatically analyze and classify the distinct information types contained in existing CONTRIBUTING documentation aimed at newcomers.In this study, we answer the following research questions: RQ1.How accurately can we automatically classify the content of CONTRIBUTING files in GitHub projects?RQ2.To what extent do OSS projects' CONTRIBUTING files cover content related to newcomers' contribution barriers?
To answer these questions, we created an oracle by manually annotating the CONTRIBUTING files from 500 software projects according to the newcomers' barriers model proposed by Steinmacher et al. [73].Then, we trained a machine learning classification model that identifies Steinmacher et al.'s six categories of barriers (precision: 0.655, recall: 0.662).The results were further validated through a survey with experienced developers, where the developers agreed that ≈75% of the categories predicted were adequate.Finally, we used our model across 2,274 publicly available projects to better understand to what extent contribution files cover the information about the barriers faced by newcomers.We found that CONTRIBUTING files are woefully inadequate for supporting newcomers.Most contain four or fewer of the six expected categories of onboarding barriers, and thousands of projects (≈65%) do not have a contribution file.
To facilitate researchers and practitioners to build upon our work by accessing its categorization model, we developed an online service that automatically analyzes the content of a given CON-TRIBUTING file (http://contributing.streamlit.app/).Our tool offers potential benefits for project maintainers and community managers by allowing them to evaluate and enhance their CONTRIBUTING files based on feedback from our classifier.This becomes especially significant in ecosystems comprising multiple projects, where community managers oversee various distinct projects.Achieving consistency across projects is crucial to reduce cognitive load and promote smooth transitions between projects within a software ecosystem.Adhering to the maintenance of CONTRIBUTING files is a recognized best practice to assist newcomers in onboarding open source software projects [71].

RELATED WORK
In this section, we highlight related studies about documentation in OSS, the automatic categorization of software engineering artifacts, and barriers newcomers face in OSS projects.
Documentation issues in OSS repositories.Documentation plays a crucial role in software projects, and deficiencies in documentation files can hinder their utility for developers [39,49,68].Lethbridge et al. [42] identify that documentation files contain excessive information, are hard to maintain, and make it challenging to locate helpful information.Such considerations are also present in the context of OSS communities [1,18,77].According to Dias et al. [15], from the perspective of OSS developers and maintainers, OSS contributors need to ensure the quality and consistency of documentation files.Our study helps to process existing documentation files and classify content relevant for newcomers, helping maintainers identify missing information in their contributing guidelines and newcomers locate relevant information.

Automatic classification of software engineering artifacts.
Several studies have automated the categorization of artifacts in software engineering [27,43,62].For example, Prana et al. [61] broke down the headers of README files in OSS repositories into eight categories of information.Based on the manual annotation of 4,226 README file sections, the authors implemented a classification model that automatically identifies the context of a section in a README file.They argue that labeling sections makes the knowledge discovery process easier for visitors.We followed a similar method and share their idea that categories may help navigate the information space, especially for outsiders.For a different type of documentation file, Robillard and Chhetri [65] categorized text fragments from API documentation based on their relevance for programmers.The authors proposed a coding guide and an automated technique to classify text fragments into three levels of relevance for programmers.The variety of studies exploring the automatic categorization of information in software-related artifacts is wide (e.g., [45,60,87]), but our study is among the first to automatically categorize information in contribution guidelines to address newcomers' contribution barriers.
Newcomers in OSS communities: Supporting and engaging newcomers increases the likelihood of newcomers completing their contributions, which is essential for the long-term viability of OSS projects [25,57,78,81].Without adequate retention, project development progress slows, jeopardizing the existence of such communities [80].To study this issue, researchers identified different obstacles newcomers face in the onboarding process, focusing on the period between their initial contact with the OSS community and their first contribution [1,[76][77][78].
Steinmacher et al. [73] propose a taxonomy of 58 barriers newcomers face when joining OSS projects.Documentation issues appear as a central source of problems for newcomers, including already mentioned challenges such as information overload, scattered and outdated documentation, and lack of necessary project information.Some researchers investigated how existing approaches support newcomers onboarding.More specifically, they focus on understanding labels to guide newcomers to choose their tasks [80,81], exploring the role of Q&A websites in helping the onboarding [84], and code visualization [57].Other studies discuss how documentation can help and cause problems and how it may impact the newcomers' experience [47,55].We believe our results inform OSS projects toward better supporting newcomers with the information they need when joining a project.
It is clear from the literature that documentation is critical for onboarding newcomers in OSS.Despite the efforts in categorizing artifacts related to project documentation, no body of knowledge exists about the appropriateness of contribution guidelines for onboarding newcomers.In this paper, we address this by analyzing the content of CONTRIBUTING files from OSS repositories in terms of barriers newcomers face.

RESEARCH METHOD OVERVIEW
To answer our research questions, we manually analyzed CON-TRIBUTING files from 500 projects and built a classifier to label information known to be relevant for newcomers.According to GitHub [31] guidelines, CONTRIBUTING files are where one should

paragraphs
Figure 1: Research method followed in this research, from building the corpus to assessing the classifier "create guidelines to communicate how people should contribute to your project." Additionally, the Open Source guide [53] reinforces that "a CONTRIBUTING file tells your audience how to participate in your project... [and] is an opportunity to communicate your expectations for contributions."Therefore, newcomers expect to find relevant information to avoid common onboarding barriers [73].
The research method was conducted in six steps, as presented in Figure 1: (1) We extracted the CONTRIBUTING files from 2,913 OSS projects hosted on GitHub.(2) The paragraphs of a random sample of files were manually annotated.(3) The annotated paragraphs were pre-processed and then converted into statistical features (i.e., term frequency-inverse document frequency) and heuristicbased features (in which a rule-based approach was performed).( 4) We trained five different classification models with these features and compared their performances.(5) We surveyed developers to assess the quality of the classifications.( 6) Finally, we used our model to classify the content of 2,274 CONTRIBUTING files and to understand to what extent they cover the onboarding barriers.
All scripts, models, data, and results are available in our replication package [20].In the following, we present more details of the method and results of each step.

BUILDING THE CORPUS
To train our models we collected and manually categorized the content of CONTRIBUTING files from a set of OSS projects.

Categories Definition
We manually labeled each paragraph of the 500 CONTRIBUTING files according to the way Steinmacher et al. organized the categories of barriers on the FLOSScoach portal [78].The portal was created based on a barriers model built based on a systematic literature review, interviews with multiple stakeholders, and surveys within OSS communities, providing a comprehensive aggregation of the barriers newcomers face when joining OSS projects.In addition to the comprehensiveness of the model, we chose to follow these categories since the work by Steinmacher et al. [74,78] showed that organizing the information in these categories lowered the barriers related to orientation and contribution process.These are the categories we used: CF -Contribution flow: Derived from the "Newcomer Orientation" barrier category mapped to the contribution flow shown under "How to Start" in FLOSScoach, this category defines the steps that a newcomer needs to follow to contribute to the project.This category appears as, for example, an ordered list of steps to follow or as a set of paragraphs describing the current project workflow.CT -Choose a task: Also derived from the "Newcomer Orientation" barrier, it is mapped from the "Choose a Task" menu item in FLOSScoach.This category explains how newcomers can find a task (or issue) to contribute to the project.It may also contain descriptions of different types of tasks appropriate for newcomers.TC -Talk to the community: Related to the "Communication Issues" barriers, this category refers to information about how a newcomer can get in touch with community members and how to find a mentor.This category includes, for example, links to communication channels, communication etiquette, community guidelines, and tutorials on how to start a conversation.BW -Build local workspace: Mapped from the "Local Environment Setup Hurdles, " this category determines the steps a newcomer needs to follow to build the local workspace.It may include instructions such as bash commands and changes in computer settings.DC -Deal with the code: Derived from "Code/Architecture Hurdles, " it describes how newcomers should deal with the source code.This category may contain code conventions, descriptions of the source code, and guidelines on how to write code for the project.SC -Submit the changes: Directly mapped from "Change Request Hurdles," this category represents information about how newcomers should submit a contribution to the project.

Data Collection
4.2.1 Project Selection.We selected the most popular OSS repositories hosted on GitHub when we started the data collection (Aug 2020), written in at least one of the top 10 programming languages used in the platform.We selected projects based on their popularity and programming language to avoid repositories that were toy projects or unrelated to software development.The selection of projects by popularity was based on the study of Borges et al. [8], which discusses stars as a unit to measure the popularity of OSS projects on GitHub, and shows that, in their population, "three out of four developers consider the number of stars before using or Table 1: Number of projects removed per language and their respective reasons for exclusion.The "n" value represents the total number of projects collected for a language.CONTRIBUTING files may have been excluded for more than one reason.contributing to a GitHub project".In addition to it, this is a fairly common way to sample projects on GitHub [33,59,61,63,75].
To identify the top 10 most-used languages, we used the ranking provided by GitHub Octoverse [29], which showed, at that time: JavaScript, Python, Java, PHP, C#, C++, TypeScript, Shell, C, and Ruby.We aimed to get the first 1,000 projects per language ranked by stars.However, the GitHub API provides only a few pages containing the top projects and we could not collect 1,000 projects for some languages.We collected a total of 9,514 repositories.
To ensure that all the selected repositories had a valid CON-TRIBUTING file, we defined a set of filters to remove projects in our dataset.We removed from our sample the projects that had a CONTRIBUTING file: i. missing-we focused only on projects that followed the guidelines from GitHub to keep in this specific file information about how to contribute; ii.smaller than 0.5kB-to filter out those files that redirect to guidelines not hosted on GitHub, or empty files; iii.written in a language other than English; iv.not in Markdown format-which was the most prevalent format in our sample (3,295 out of 3,459 projects that had a CONTRIBUTING file were in Markdown-95.2%).Table 1 shows the number of projects per programming language removed from our dataset.The final set of repositories comprised 2,915 projects.After applying the filters, we kept a diverse number of projects in terms of the number of contributors, forks, pull requests, and stars (see Figure 2).The programming languages with the highest number of repositories included in the analysis were TypeScript, JavaScript, and Ruby.

Documentation Formatting.
To prepare the projects for the qualitative analysis, we converted the contribution files into spreadsheets.Each spreadsheet maps to all paragraphs of one contributing file in our sample.The first column of each row of the spreadsheet contained in plaintext format one paragraph of the documentation file for the respective project.We followed the definition of a paragraph provided by the specification of GitHub Flavored Markdown [30], which specifies it as "a sequence of non-blank lines that cannot be interpreted as other kinds of blocks forms." To facilitate the work of the annotators, we created headers for six columns, each representing one of the six categories we aimed to identify during the qualitative analysis.

Data Annotation
After transforming the CONTRIBUTING files into spreadsheets, we conducted the annotation process.We annotated a total of 500 spreadsheets (from 500 projects).In the first step, two annotators labeled 30 spreadsheets of a random subset of projects and discussed how the categories should be assigned to each paragraph.To measure the agreement between the annotators, they independently labeled the spreadsheets divided into three consecutive stagesconsisting of 10 spreadsheets per stage.The annotation consisted of analyzing and labeling each paragraph according to the categories presented in Section 4.1.At the end of each stage, the reviewers compared their labels and discussed their differences to align their understanding of each category.We use Cohen's kappa coefficient to measure the agreement between the annotators [13].After the first stage, the annotators reached an agreement of 73% and discussed the potential meaning of categories.For the other two stages, the agreement was 85% and 79%, respectively.The overall agreement between the annotators was 79%, which was considered sufficient given the multi-class nature of the data.

Documentation Annotation.
After reaching a substantial agreement, the reviewers proceeded to analyze the remaining files, which were split between them.A total of 500 spreadsheets were annotated during the qualitative analysis, resulting in 20,733 paragraphs analyzed.We had to dismiss 66 files that did not present any information about the six categories of barriers, which were replaced by 66 other files from our dataset.After the replacement, we ended up with 19,961 paragraphs.

Corpus Characterization
In Figure 3, we present the distribution of paragraphs analyzed per file.The average number of paragraphs for our set of 500 projects was 41, and the median was 29.Two projects had only 2 paragraphs (minimum), and one had 422 paragraphs in a single file (maximum).Table 2 shows the distribution of categories in our sample.More than 6,000 paragraphs were categorized as "Submit the changes", and more than 2,000 as "Deal with the code" and "Contribution flow."On the other hand, "Choose a task" and "Talk to the community" appear in 116 and 183 paragraphs respectively.Still, 7,461 paragraphs could not be categorized under any category.We analyzed these paragraphs and found different types of content that did not belong to any category.The most recurring cases were: "thank you" messages; license statements or the complete license; links to other sites; instructions on how to open an issue; information in different languages; GitHub badges; and lists of contributors.

BUILDING AND EVALUATING THE CLASSIFIER
We trained machine learning models to classify information according to the six categories defined in Section 4.1.The 500 annotated spreadsheets were used to extract features for classification (Section 5.1).The data was prepared using text pre-processing techniques.The features created were divided into statistical features (i.e., extracted using statistical methods) and heuristic features (i.e., extracted through identifying linguistic patterns).
The features were then applied to supervised learning algorithms to find the best classification model for this problem (Section 5.2).The annotated dataset was split into two random subsets.A training set (80% of the dataset) was used to compare the different classifiers, and a test set (20% of the dataset) was reserved for testing the classification algorithm with the highest evaluation score.The algorithms were evaluated based on their classification scores, and a final model was trained using the best-performing classification algorithm (Section 5.5). Figure 4 provides an overview of the classification process, which is detailed in the following sections.

Feature Extraction
In the feature extraction process, the annotated paragraphs (Section 4.3) were converted into numerical data.We divided the feature extraction process into four stages: text pre-processing, the definition of statistical features, the definition of heuristic features, and feature selection.

Text
Pre-processing.Before creating any features for the classifier, three pre-processing techniques used in text classification were applied to the paragraphs: lemmatization, stop words, and punctuation removal.In the lemmatization process, the affixes of words in each paragraph were removed, turning the words back to their root form [37]. Words such as submits, submitted, and submitting, for example, were returned to their root form submit.To reduce the number of ineffective words in the paragraphs' classification, we also removed stop words, excluding words commonly found in the English vocabulary (e.g., conjunctions and pronouns) [86].For the same purpose, punctuation was also removed from the text.For both lemmatization and stop words removal, we used the implementations provided by the NLTK library [44].

Statistical
Features.We converted the annotated paragraphs into TF-IDF features using the TfIdfVectorizer method of the scikitlearn library [6,58].In this approach, we represented words as n-grams of size one and two [5].The acronym TF-IDF is a reference for the multiplication of two statistical measures used in text classification, term frequency (TF) and inverse document frequency (IDF) [38,82].For term frequency, we measured how often words occur in a paragraph (number of occurrences of each word per paragraph, divided by the total words in that paragraph).For the inverse document frequency, we counted how often words occur compared to the entire set of paragraphs.The multiplication of both measures gives us statistical features that show the relative importance of each word.

Heuristic Features.
The set of statistical features was combined with heuristics found through qualitative analysis to enrich the characteristics used in classification.We adopted a strategy used by previous work [56,61], in which features are generated by analyzing linguistic patterns in the annotated paragraphs.During the manual analysis, the annotators selected words that could characterize specific categories and be used as patterns for a classification based on heuristics.For example, the word "commit" was commonly found in paragraphs annotated as Submit the changes (see Table 3 for examples of other categories).Using the rule-based matching approach of the Spacy library [72], we assigned an equal set of heuristic features to each paragraph in the training process.Each feature represented a pattern; paragraphs were assigned the value of 1 when they contained the respective words and 0 otherwise.

Feature Selection.
To avoid using features that could be considered irrelevant to our classification, we removed the ones with the lowest scores.The SelectPercentile [69] method of the scikitlearn library was used with Chi-square as the score function.Features that fell below the 15 ℎ percentile were removed.We manually tested a set of percentiles (5th, 10th, 15th, 20th) based on the default value of Scikit-learn [69], which is the 10th percentile.We chose the 15 ℎ percentile as it performed best, as it is commonly done in the literature [4,14,70].

Finding the Best Classifier
A set of classifiers was trained to find the best learning algorithm to solve our classification problem.To train the classifiers, we used two multi-class training strategies: one-vs-rest (OvR) and one-vs-one (OvO) [7].In the OvR strategy, a binary classifier was trained for each category.The assignment of a category for a paragraph was then made by identifying the binary classifier that best represented the respective paragraph (i.e., the one with the best scores).In the OvO strategy, the samples of each category were grouped in pairs, and the comparison was made in a binary classifier for two categories at a time.To identify what category should be assigned for a paragraph, the predominance of a category among all the pairs was considered as the decision method.
The following classification algorithms were trained during this step: RandomForestClassifier, KNeighborsClassifier, LinearSVC, MultinomialNB, LogisticRegression.The selection was based on similar studies using text classification in Software Engineering [56,61].As a baseline, we trained two dummy classifiers, one using the most frequent class label observed in the training set and one providing completely random predictions.As highlighted in Table 2, we noticed that the number of instances per category was unbalanced in our data set, so we used the SMOTE oversampling technique to achieve a better balance between the classes.The SMOTE algorithm was implemented using the imbalanced-learn library [40], a module designed for unbalanced datasets that are recommended by the scikit-learn community.Still, we used chat-GPT (GPT 3.5 model) [54] using a few-shot learning approach [10] to compare our results with the performance of this LLM.For the few-shot learning, we randomly selected 12 instances of paragraphs in our training set for each category.Then, we prompted chatGPT to classify 200 instances randomly selected from our test set.

Evaluation Metrics
To measure the overall performance of the classifiers, we used a combination of three evaluation metrics for data classification: precision, recall, and F1 score.Precision, also known as confidence, provides the proportion of positive samples that were correctly predicted, in contrast to all the samples predicted as positive [28,85].Recall gives the fraction of positive samples correctly predicted by the classifier, and the F1 score provides the harmonic mean between precision and recall values [12,36].This is a multi-class problem, and the resulting values are the weighted average of all classes.

Cross-validation
We tested the performance of our classifiers using a nested ten-fold cross-validation strategy [17,52].This algorithm divides the dataset of features and labels into ten parts.The ten parts are combined into ten different training and validation subsets, also known as folds.For each fold  divided into  parts ( = 10), the   part is used as the validation set, and the remaining parts are used as training for each classification algorithm in our list.The average of the weighted F1 scores of the  different classifiers gave us an overall performance for each learning algorithm.
To increase the chance of selecting the best parameters for each algorithm, we applied the GridSearch method to the crossvalidation internal loop.The values tested were based on the default values in the scikit-learn library.We selected the best configuration (i.e., classifier, parameters, and training strategy) to train a final classification model.

Classifier Training
With the best learning algorithm selected, we trained a classifier using the complete dataset used in the previous step (training set in Fig. 4) and the test set (Fig. 4) was used for testing.This was done to show the reliability of both the model and the results [56,61,69].This enabled us to use our complete annotated training set (80% of the sample) and test on the 20% of the original annotated dataset that was not previously used.

Classifier Human Evaluation
To further evaluate our classification model, we surveyed 46 individuals using Amazon Mechanical Turk [2].We invited only individuals with prior programming experience by specifying on the Amazon platform "Employment Industry -Software & IT Services" as a selection criterion.The survey was divided into training, evaluation, and demographics.In the training section, we introduced the survey and described the six categories of information used to classify the content.To guarantee that we would only consider participants who paid attention to our questionnaire, we asked them to match each category with their corresponding definitions on a subsequent page.An attention check question was also included among the questions of our questionnaire.We dismissed the answers from 29 participants who selected the wrong definition for more than one category (28) or did not mark the correct answer on the attention check (6)-5 of them selected the wrong answers in both.We ultimately considered 17 valid answers for analysis.Although the number of remaining participants is small due to the substantial number of discarded answers, this is considered a common limitation of crowdsourcing platforms, as suggested in recent studies [46,64,79]).The literature also mentions that cleaning the answers is necessary to guarantee data quality and consistency [64].
In the evaluation section, we asked participants to judge the quality of the predictions.We used paragraphs from CONTRIBUT-ING files of 75 randomly selected projects that were not part of the training or test set-and thus were unknown by the classifier.We randomly selected ten paragraphs classified into each category from that set of OSS projects to use in the questionnaire.For each participant, we randomly assigned 18 paragraphs (3 per category).We gave the same number of sections for each category (instead of a complete file) per participant to ensure the number of paragraphs assessed per category was balanced.Moreover, we use an approach in which the participants recognize if an item belongs to a category (aided recall) instead of asking the participants to label them (unaided recall) because items requiring recognition are easier than items that use unaided recall [9].So, we provided annotated paragraphs and asked whether the category was correct.We asked each participant to rate each prediction using a 4-point Likert scale (extremely adequate to extremely inadequate).We employed a 4-point Likert scale to compel respondents to take a definitive stance, preventing the use of a neutral option [11].
In the demographics section, we asked participants to provide information about their experience with OSS projects.This section included two multiple-choice questions about years of experience in programming and maintenance of OSS projects and two questions about their role as participants in OSS (coder or non-coder) and their contributions to documentation in OSS projects (Yes/No).In Table 4, we present the overall experience of our participants in programming and OSS projects.As expected, all of our participants had some experience in programming, with the majority of them (64%) having between 3 and 15 years of programming experience.In terms of experience as maintainers in OSS projects, 82% of our participants had at least some experience, and 41% of them had between 3 and 15 years of experience as software maintainers.All 14 participants with experience in OSS defined themselves as coders, and 10 worked with documentation in their projects.This section details the evaluation of the machine learning models.

Comparing Different Classifiers
To identify the best classifier for our problem, we compared the outputs of five machine learning algorithms and two dummy algorithms in a ten-fold cross-validation process, in addition to chatGPT with a few-shot learning approach [10].Table 5 presents the F1 scores for each classifier.The best F1 score of 0.652 is from the Lin-earSVC classifier, without oversampling and using the OvR strategy.
The second-best score is from the same classifier configuration but uses the OvO multi-class strategy.The performance of chatGPT using a few-shot approach reached an overall macro precision of 0.250, recall of 0.322, and F1 equal to 0.272.Ignoring the dummy classifiers and chatGPT, the classification model with the worst scores of 0.516 and 0.530 was kNN.Such results follow similar performance found by Prana et al. [61], who categorized the content of README files.Because of its scores and similar performance in other studies, LinearSVC was chosen as the final machine learning algorithm.Based on the outputs of the GridSearch algorithm, we found that the best hyper-parameters for LinearSVC were 1,000 iterations (max_iter = 1000), regularization equal to one (C = 1), and tolerance equal to 0.001 (tol = 0.001).The LinearSVC algorithm was trained again with this final configuration without oversampling and using the OvR strategy, as this combination provided the best F1 score in our comparison of classifiers.Table 6 presents the training data and its performance per class in relation to the test set.
In Table 6, we can see that the performance varies per category.The information about Deal with the code (DC) and Build local workspace (BW) barriers is fairly well predicted (F1 0.711 and 0.716, respectively).On the other hand, Choose a task (CT) and Contribution flow (CF) had the lowest scores of 0.379 and 0.345, respectively.Some external factors may have influenced such performances.The number of instances per class, for example, might justify the low score of Choose a Task (CT), which on average had only 1 paragraph per project analyzed (see Section 4.4).The fact that the Contribution flow contained more generic information than other content-specific categories, such as Build local workspace, might also explain the difference in performance.
For the sake of comparison, we include the results of the chatGPT few-shot learning approach per class in Table 7.As can be observed, the Linear SVC model outperforms chatGPT in this context in almost all metrics and categories.The exception are the recall for the Talk to Community and Contribution Flow categories.Interestingly, chatGPT could not correctly identify any paragraph belonging to the Choose a Task Category-although it received 5 instances, and (incorrectly) predicted 3 instances.Considering the overall metrics, the linearSVC model is more than 2x better than chatGPT in terms of recall, precision, and F1.Confusion between categories In Figure 5, we present the confusion matrix produced by the final classification model.Using a confusion matrix, we can assess the similarity between the different categories of information and verify what labels contain false positives.The main diagonal represents the true positives for each class, and the upper and lower triangular submatrices represent the misclassifications.
In line with the previous results, contribution flow (CF) and Choose a task (CT) are the categories of information with the highest amount of misclassifications, with only 26% true positives.Contribution flow (CF) had more false positives assigned to deal with the code (DC) than its own category.Such results may confirm the assumption that because contribution flow (CF) contains a wide range of information, and choose a task (CT) has just a few samples used for training, they performed poorly.
All other categories had more than 50% true positives.Build local workspace (BW) and Deal with the code (DC) had the lowest number of false positives (< 25%).Talk to the community (TC) also presented good performance, with less than 36% incorrect predictions.This may be because such categories contain more specific content and a good number of samples per class.

Observations from the Survey
In Figure 6, we present the participants' evaluation of the predictions made by our final model.For all the categories, at least 30% of the predictions were considered extremely adequate for their paragraphs, and at least 69% of the predicted categories were considered at least somewhat adequate.The best-evaluated category was Build local workspace (BW), with 47% of participants considering its predictions extremely adequate.When we aggregate extremely adequate and somewhat adequate, Deal with the code (DC) leads the adequacy board with 82% of predictions considered adequate.Contribution flow (CF) has the lowest estimates, with 31% of its predictions estimated as somewhat or extremely inadequate.The second to last place is held by Choose a task (CT) and Talk to the community (TC), with 29% of their predictions considered somewhat inadequate or less.Such results follow similar outcomes found in the evaluation scores of Table 6, confirming the nature of our predictions.
To further understand the disagreement between the classifier output and the crowd, we manually analyzed the 12 paragraphs in which 50% or more of the respondents disagreed with the predictions.In summary, we found that the prediction was incorrect in 9 cases.In the 3 other cases, the prediction was correct (2 related to Choose a Task and one related to Build the Workspace).
Answer to RQ1: After comparing five supervised learning algorithms, we were able to classify the content of CON-TRIBUTING files achieving an F-measure of 0.651, with precision = 0.655 and recall = 0.622.Although different categories of information differed in performance, in general 69% of the classifications were considered appropriate by external reviewers.

RQ2. TO WHAT EXTENT DO CONTRIBUTING FILES COVER CONTENT RELATED TO CONTRIBUTION BARRIERS?
We used the classification model to predict a set of the remaining 2,274 CONTRIBUTING files from our dataset that we had not used in the previous steps.From the 9,514 files, we removed 6,599 because they did not meet the filtering criteria presented in Section 4.2.
Figure 7 shows the distribution of projects and the average of paragraphs per category in the CONTRIBUTING files in which each category appeared at least once.A total of 2,265 (99.6%) projects had at least one paragraph that did not belong to any category, with an average of 15 unidentified paragraphs per file.Submit the changes was the category with the highest number of paragraphs per CONTRIBUTING file, appearing in 2,192 projects.The Deal with the code category represented the second highest average of paragraphs per CONTRIBUTING file and the second in the number of projects, being identified in 1,660 projects with an average of 4 paragraphs per file.Contribution Flow was the category with the third highest frequency, appearing in 1,648 projects, with an average of only two paragraphs per project.A similar phenomenon happened with Build local workspace (1,162 projects; 2 paragraphs/project).Talk to the community (513 projects) and Choose a task (332 projects) were in the lowest positions.
Regarding the frequency of categories per project, not all categories are covered by the CONTRIBUTING files (see Figure 8).From our set of 2,274 OSS projects, we identified 729 with content related to four of the six categories (32%), 603 related to 3 (27%), and 411 files containing only two categories (18%).For 287 projects, we identified information about five categories, and for only 65 projects (6%), the classifier identified information about all six categories.On the lower bound, only one category of information was identified for 165 projects.We also found 14 projects where no categories were identified.In a manual inspection of their CONTRIBUTING files, we detected that none of them present any information that could be mapped to any of the six categories, validating the analysis made by the classifier.While some presented ways to report an issue, others contained links to contribution guidelines elsewhere (some on the GitHub wiki, others outside GitHub).The distribution of categories is in line with the distribution of 500 projects manually annotated during the qualitative analysis, providing further evidence of the adequacy of the classifier.The only differences are that no projects analyzed had zero categories of information and the Contribution flow category had a slightly higher average of paragraphs per CONTRIBUTING file.
In summary, more than 50% of the CONTRIBUTING files present information pertaining to fewer than 3 categories of barriers faced by newcomers, while only 15% present information classified in 5 or 6 different categories.These results-in addition to the fact that more than 60% of the projects collected do not have a CONTRIBUT-ING file (Table 1)-evidence that this highly relevant resource for new contributors is still inadequate for mitigating barriers faced by newcomers.In particular, the lack of content about Choosing a task (CT) and Building the workspace (BW) is crucial and may hinder onboarding and lead to dropouts [67,74].
Answer to RQ2: Most CONTRIBUTING files focus on the final stages of the contribution process.Categories containing information such as how to submit the changes and deal with the code are the most frequent, while information about choosing a task and contacting the community is often missing.

DISCUSSION
Lack of essential information for newcomers.In our study, we noticed that many projects do not provide primary information that new contributors may need when attempting to contribute to a project.This was highlighted in previous literature [71,74] and evidenced in our analysis based on the number of categories of information covered per project in Figure 8 and Table 2. Most projects had a maximum of 3 out of 6 categories covered in their CONTRIBUTING files.This suggests that OSS projects might not satisfy newcomers' needs in terms of documentation when considering the categories defined by the literature.Some of the most critical barriers faced by newcomers [74,78] are not covered by the CONTRIBUTING files.Table 2 shows that only 23% of the files analyzed had information about how to Choose a task, and 28% of them presented some information about how to Build their local workspace.The "curse of expertise" [24], i.e., the inherent cognitive bias stemming from the deep familiarity with the subject matter, may hamper project maintainers' ability to accurately evaluate the comprehensiveness and clarity of their documentation.Our results can shed light on the gaps in the existing documentation from the perspective of barriers commonly faced by newcomers.
A more critical problem is also evidenced in Table 1, showing that ≈ 65% of the projects in our sample (more than 6,000 in absolute numbers) do not have a CONTRIBUTING file available in their repositories.Although some projects prefer to use other resources to explain their contributing process (e.g., Valhalla [83] uses a section in their README file), many popular repositories do not contain any orientation for newcomers, even though they are open to external submissions (e.g., Google Sanitizers [32], Microsoft PHPSQL [50], NVIDIA NCCL [51]).
Most files focus on the contribution process's final steps.In Figure 7, we show the average number of paragraphs identified per category in the projects of our qualitative analysis.The results suggest that the category with the highest number of paragraphs is Submit the Changes, followed by Contribution Flow and Deal with the Code.Although Dealing with the code focuses on the more general steps of the project, Submitting the changes and Dealing with the code are intended to be relevant for newcomers in the later stages of their contribution, after they selected a task, built their workspace, and established communication with the project's community.This result suggests that projects tend to focus more on the last stages of the contribution, assuming newcomers already know how to implement their contribution.
Implications for practice and research.As a result of this study, we also implemented a web tool to provide feedback to project maintainers about their CONTRIBUTING files [19].The maintainer only needs to input their project URL, and our tool reviews the project's CONTRIBUTING file using our classification model.The tool provides a chart showing the distribution of paragraphs per category of information, a discussion about the dominant categories (i.e., the highest number of paragraphs) and weak categories (i.e., the lowest number of paragraphs), and a comparison of the input project with other popular repositories on GitHub.In addition, the tool provides a clear description for each category when the report is presented to the user, highlighting why they are important.The report provided by the tool also suggests CONTRIBUTING files that maintainers could use as inspiration to enrich a specific faulty category.We envision the proposed tool as a starting point to support better documentation files.This tool could be particularly useful to community builders and managers who oversee a non-trivial number of projects.Those playing these roles need information about the content of CON-TRIBUTING files in multiple projects in the ecosystem to take action.The classifier may support their efforts by providing insights into the types of information available for each project.
The tool is also an important step toward implementing automated on-demand developer documentation, which automatically parses documentation and generates responses to user queries [66], and smart assistants [16].These tools need to parse existing documentation and classify information in order to provide adequate assistance to newcomers.
From the research perspective, our study helps to understand how the current content of CONTRIBUTING files addresses newcomers' needs.Our work can be extended to evaluate the content quality of the CONTRIBUTING files, which may help newcomers find appropriate documentation.Future work can also investigate the subcategories of Steinmacher et al. 's model [73].

LIMITATIONS AND DESIGN DECISIONS
In this section, we present our work's limitations and trade-offs for research design decisions.
Using the most popular projects from GitHub.We focused our study on GitHub and the results may not generalize for the whole OSS universe.Nevertheless, GitHub is arguably the most popular OSS hosting platform.Additionally, the selected projects may not generalize to GitHub as well, since our projects were selected based on their programming language and popularity.Still, there may be projects that are not exactly software projects in our sample, like "algorithms" and "awesome lists"-in a manual analysis, these projects correspond to ≈1% of our sample.We acknowledge that a more diverse set of projects would potentially bring more data points with different styles.However, focusing on more popular projects and on GitHub brings more confidence about the relevance of the CONTRIBUTING files analyzed.We opted to keep a more trustful set of projects, rather than expanding the data points.
Unit of analysis.Our approaches to selecting, filtering, analyzing, and classifying documentation files were based on prior studies [8,26,61].Still, our decision to choose paragraphs as the unit of analysis instead of lines or larger chunks of text could impact our results.We attempted to use lines as units of analysis, but they did not provide enough context to identify the categories during manual analysis since the information in CONTRIBUTING files is highly contextual.Paragraphs provide enough context for identifying the categories, and Markdown provides a standard approach to split the content into paragraphs (i.e., blank lines).
Other approaches to determining the content of CONTRIBUT-ING files could also be used.For example, a classification model based on section names could be a great alternative for our decision.We decided to make our classification based on paragraphs and not on section names for the same reasons that we did not use lines as units of analysis.We also decided to keep duplicated paragraphs from distinct projects in our dataset, as CONTRIBUTING files from different projects may follow similar guidelines.We ran LinearSVC without the duplicated paragraphs, and the performance was similar to our final model (precision: 0.612, recall: 0.621).
It is also worth mentioning that we only analyzed the content available on the CONTRIBUTING files.We did not explore any external links from these files or resources they reference; we also did not check textual, HTML, .ris,or other types of files containing contribution guidelines.We analyzed 95 README files from (randomly selected) projects that we dismissed because of the absence of the CONTRIBUTING file; only three had links to external guidelines, and six had sections related to contribution.This could have limited the conclusions made for projects such as Apple Swift [3], and others highlighted in Section 7, whose contribution file only contained directions to other sources of documentation.Future studies are encouraged to (i) analyze one level of depth using the links available in the CONTRIBUTING files, and (ii) understand how to use the proposed approach to refactor contributing guidelines contained on README and other textual files onto CONTRIBUTING files.
Representativeness of contributors' perspective.To assess the quality of our classification model, we invited participants with programming expertise to answer questions in which they judged a set of predictions made by our classifier.Although we introduced a tutorial at the beginning of the questionnaire, we cannot guarantee that the answers given by the respondents represent the perspective of contributors or the correctness of the predicted categories.To mitigate this problem, we not only asked the participants to match the categories definition with their names in the early stages of the survey but also included an attention check question into our set of questions to ensure participants did not randomly assign answers for them.Once again, our choice was guided by the trustfulness of the data points.We kept only a small set of answers, which can be considered more reliable than having more data points and losing reliability.
Coverage of the categories and information.We decided to use a pre-existing set of categories to label our dataset according to the barriers newcomers could face.We acknowledge that the categories analyzed may not cover all the information a newcomer may need when contributing to a project.However, the set of categories resulted from several studies investigating problems associated with documentation files in the context of OSS repositories [76,77].We opted to classify our data using a validated set of categories rather than explore potential new categories.
Construction of the classifier.To build a classification model from scratch, a set of design decisions were made throughout the process.We understand that other strategies could have been adopted in building our model (e.g., the use of additional pre-trained models) and that the decisions made may have an influence on the performance reported in this study.To mitigate this issue, we compared our classifier with chatGPT in Section 6.1, and trained a supervised model with the same dataset using FastText [21] (precision: 0.653, recall: 0.653).Both strategies presented a similar or worse performance than our final classifier.We also undertook an ablation study to determine the impact of the heuristic and statistical features.Two models were constructed using the same configurations as our final classifier: one solely with statistical features (precision: 0.657, recall: 0.664) and the other with only heuristic features (precision: 0.414, recall: 0.493).Both models exhibited performance comparable to, or less than, our final estimator.

CONCLUSION
A primary documentation resource for newcomers embarking on open source software projects is the CONTRIBUTING file.Located within repositories, these files outline the project's contribution guidelines.While many OSS communities utilize CONTRIBUTING files to orient newcomers, the comprehensiveness of their content was largely unexplored.
In this paper, we investigate the extent to which CONTRIBUT-ING files address the onboarding barriers newcomers face in OSS projects.Drawing upon a barrier model from existing literature [74], we manually analyzed CONTRIBUTING files from 500 projects.Our findings indicate a notable lack of information: 90% of the projects lacked content in at least two of the six information categories, with 79% missing details in three or more categories.Notably, our manual review revealed that over 75% of the projects failed to include guidance on task selection and workspace setup, two key barriers for newcomers as highlighted by Steinmacher et al. [74].
We also built a machine learning model designed to automatically classify the information from CONTRIBUTING files from other projects and thereby help projects identify missing information in their files.Overall, the classifier performed well in this multiclass problem, with an overall precision of 0.655 and a recall of 0.662.The performance was good for four out of the six categories of information (F1 ≥ 0.61): Build local workspace, Deal with the code, Talk to the community, and Submit the changes.Exceptions were How to choose a task and Contributing flow, with low recall (< 0.3) and F1 of 0.379 and 0.345, respectively.
In summary, our findings indicate that many OSS projects need to improve the comprehensiveness of their 'CONTRIBUTING' files to better cater to newcomers.Evaluating 2,274 projects using our machine learning model, our results echoed the findings from our qualitative assessment: 84% of the projects lacked content in at least two of the six information categories, and 52% were deficient in three or more categories.To assist with this issue, we developed an online tool designed to offer feedback to project maintainers about how their 'CONTRIBUTING' files address onboarding challenges, ensuring that the communities are better equipped to welcome and nurture their next generation of contributors.

DATA AVAILABILITY
The artifacts used in this paper are available on Zenodo [20].

Figure 2 :
Figure 2: Distribution of contributors, forks, pull requests, and stars per project considered as valid.

Figure 3 :
Figure 3: Distribution of paragraphs per file.

Figure 5 :
Figure 5: Confusion matrix for LinearSVC.Legend: BW (Build local workspace), DC (Deal with the code), TC (Talk with the community), SC (Submit the changes), CT (Choose a task), CF (Contribution flow).

Figure 6 :
Figure 6: Survey: Participants' evaluation of predictions made by the final classification model.Legend: BW (Build local workspace), DC (Deal with the code), TC (Talk with the community), SC (Submit the changes), CT (Choose a task), CF (Contribution flow).

Figure 7 :
Figure 7: Average number of paragraphs per category in the CONTRIBUTING files predicted.Legend: BW (Build local workspace, DC (Deal with the code), TC (Talk with the community), SC (Submit the changes), CT (Choose a task), CF (Contribution flow), NC (No categories identified).

Figure 8 :
Figure 8: Distribution of categories per CONTRIBUTING file predicted.The percentages represent the proportion of each category in the respective subset of files.The distribution of categories is in line with the distribution of 500 projects manually annotated during the qualitative analysis, providing further evidence of the adequacy of the classifier.The only differences are that no projects analyzed had zero categories of information and the Contribution flow category had a slightly higher average of paragraphs per CONTRIBUTING file.In summary, more than 50% of the CONTRIBUTING files present information pertaining to fewer than 3 categories of barriers faced by newcomers, while only 15% present information classified in 5 or 6 different categories.These results-in addition to the fact that more than 60% of the projects collected do not have a CONTRIBUT-ING file (Table1)-evidence that this highly relevant resource for new contributors is still inadequate for mitigating barriers faced by newcomers.In particular, the lack of content about Choosing a task (CT) and Building the workspace (BW) is crucial and may hinder onboarding and lead to dropouts[67,74].

Table 2 :
[74]acterization of the dataset considering the six categories of barriers[74]

Table 3 :
Examples of heuristic features per category.

Table 4 :
Experience of survey participants in programming and OSS project maintenance.

Table 5 :
F1 scores for classifiers tested in the ten-fold crossvalidation process.

Table 6 :
Performance of the final model (LinearSVC).

Table 7 :
Performance of chatGPT with few-shot learning.