GitHubInclusifier: Finding and fixing non-inclusive language in GitHub Repositories

Non-inclusive language in software artefacts has been recognised as a serious problem. We describe a tool to find and fix non-inclusive language in a variety of GitHub repository artefacts. These include various README files, PDFs, code comments, and code. A wide variety of non-inclusive language including racist, ageist, ableist, violent and others are located and issues created, tagging the artefacts for checking. Suggested fixes can be generated using third-party LLM APIs, and approved changes made to documents, including code refactorings, and committed to the repository. The tool and evaluation data are available from: https://github.com/LiamTodd/github-inclusifier The demo video is available at: https://www.youtube.com/watch?v=1z1QKdQg-nM


INTRODUCTION
Concern has been growing about various biases and non-inclusive practices in software engineering and engineered software, including but not limited to gender, race, age, neurodiversity and others [2,3,8,[10][11][12]15].Related to this, there has been increasing concern about non-inclusive language in software artefacts, and several guidelines developed to help developers to address this [1,5,9] and bots for 'tone policing' [6].One approach is to apply style transfer to non-inclusive language to 'inclusify' it [13,14].However, finding and fixing biased language in documentation, code and code comments in github repositories is very challenging [4,7,16].
We present GitHubInclusifier, a prototype tool designed to discourage the use of non-inclusive language in software repositories, in both non-code and code artefacts, and to provide its users with means of rectifying such language usage with more inclusive alternatives.GitHubInclusifier is aimed towards software developers who work collaboratively on software projects using GitHub.Upon linking GitHubInclusifier to a GitHub repository, it analyses the language used within it, in both non-code and code artefacts, extracting the details of instances of non-inclusive language.Through a number of user-friendly interfaces, these instances are highlighted to the user, and they are presented with tools which can be used to alter the language choices in the repository.GitHubInclusifier leverages third-party large language models (LLMs) to suggest alternative text, and implements a refactoring tool allowing users to efficiently alter non-inclusive language choices in code files, safely pushing these changes directly to the source repository.Figure 1 shows examples of non-inclusive language in documentation and code files.Ideally on committing files with non-inclusive language such instances could be detected and changes suggested and made.Making such inclusifying language style transfer changes can be challenging.In example (a), changing 'cripple' to e.g.'severely impact' would remove ableist language, though care needs to be taken not to change overall meaning.If the method comments in (b) and (c) are changed, care is needed that meaning is not changed but also e.g. if referring to a method or variable name, the reference is not broken.Changing e.g.method name in (c) means code refactoring is needed not only on the target code file, but whole application.Finally, if a non-inclusive term in the documentation refers to a code element, all instances need changing in other documentation and code needs refactoring.The problem becomes even more challenging when composite words have problematic language e.g.'abort_signal'.

MOTIVATION
We wanted to support GitHub repository users to inclusify language in diverse repository artefacts including README and related files; PDFs; code comments; code elements (class, method, variable, property etc names); and -in future -even images with non-inclusive terms and/or representations.We wanted to provide GitHub users a user-friendly interface to locate non-inclusive language in a variety of artefacts; summarise the type of non-inclusive language; help them to modify the problematic language by suggesting changes; make the changes to impacted artefacts; and commit the changes to the repository.

OUR APPROACH
GitHubInclusifier is a web application designed to support the use of inclusive language in GitHub Repository hosted artefacts.GitHubInclusifier's target users are software developers working collaboratively on software projects.It provides them with features for both recognising, and rectifying usages of non-inclusive language within these projects.

Inclusification Process
The key features of GitHubInclusifier which enable the recognition of non-inclusive language usage within software projects include: • Repository-wide search for non-inclusive language terms using whole word and substring pattern matching; • Automated reporting of non-inclusive language usage as a GitHub issue; • Repository explorer with non-inclusive language usages visibly indicated; • Non-inclusive language report on a per-file basis; • Code-specific non-inclusive language report for Python and Java source code files on a per-file basis, as well as a repositorywide basis; • Export of non-inclusive language reports for findings by both word-boundary, and substring pattern matching algorithms, on a repository-wide basis.
The key features of GitHubInclusifier which permit software developers to rectify the usage of non-inclusive language are: • Non-inclusive to inclusive language update suggestion by user's choice of LLM; • Code refactoring feature for variable, function, and class declaration and usage renaming for Python source code on a repository-wide basis; • Automated commit & pull request creation after refactoring.

GitHubInclusifier Implementation
GitHubInclusifier is implemented as a client-server architecture, consisting of a front-end built with React and a back-end built using Django, outlined in Figure 2 (i).The front-end provides  an interactive interface for users to link a GitHub repository to GitHubInclusifier, explore the repository's file tree, view and export non-inclusive language reports, as well as generate suggested changes and efficiently refactor the code via a simple form.The back-end handles interaction with GitHub's API and third-party LLM APIs, performs non-inclusive language analyses, and carries out the refactors orchestrated by the user.We chose to use two fast, light-weight algorithms developed in a previous large scale github analysis project for simplicity and low overhead: a simple word-boundary pattern matching (WBPM) algorithm to search for whole word non-inclusive words and short phrases using a database of over 100 non-inclusive language phrases and words e.g.'cripple' in Figure 1.A less precise sub-string pattern matching (SSPM) algorithm is used to detect potential non-inclusive sub-strings that are part of whole words e.g.'abort_signal' in Figure 1.
Interaction with GitHub's API occurs via the PyGithub library.Upon linking a repository to GitHubInclusifier, the repository is cloned, and an issue is raised to the repository reporting the noninclusive language usages.Following any refactors decided by the user, the back-end commits the changes to a new branch, and raises a pull request which details the changes made.The back-end handles interaction with third-party LLM APIs to generate suggestions for passages featuring non-inclusive language.As a proof-ofconcept, a simple API was implemented to expose a relatively small LLM (alpaca-lora-7b) to generate suggestions.GitHubInclusifier has been built to be highly extendable to be used with any number of third-party LLMs offering web-APIs.
The back-end implements the refactorings requested by the user, outlined in Figure 2 (ii).This is done by generating an abstract syntax tree (AST) of the target Python file, selectively renaming nodes, followed by unparsing the modified AST into a new string of code and writing it back to the file.As the process of unparsing the syntax tree rids the file of any formatting, the autopep8 formatter is used to reformat the code and check it for syntax errors.In some instances, syntax errors will occur due to the ambiguous nature of Python imports, whereby the type of an imported object is not known.In these cases, the file's content is left untouched, and a cautionary message is written in the pull-request's description.

USAGE EXAMPLE
Consider John, a conscientious software developer with a strong commitment to inclusivity and diversity.He understands that using inclusive language is a vital aspect of creating a welcoming and John is currently collaborating on a software project with a diverse team of developers and contributors from different walks of life.He is motivated to make the artefacts in the GitHub repository more inclusive, as he is committed to creating a virtual workspace where everyone feels valued and respected.
As a first step, John intends to find all instances of non-inclusive language within the GitHub repository, a task that would be incredibly time-consuming and daunting if done manually.Sifting through every line of code and documentation, searching for non-inclusive language, and suggesting replacements is impractical, given that the repository contains over 10,000 lines of code and documentation.Thus, John decides to leverage GitHubInclusifier.Upon linking the repository, GitHubInclusifier carries out comprehensive checks for non-inclusive language on the repository's artefacts, including code files and documentation.
In addition to locating usages of non-inclusive language, GitHu-bInclusifier also automates the process of reporting these usages as a GitHub issue, as shown in Figure 3 (A).This can be seen by all the repository's collaborators as actionable items that can be addressed.This streamlines the communication within John's team and makes the necessary changes more visible and manageable.For each file in the repository containing non-inclusive language, GitHubInclusifier provides a detailed report, highlighting instances and details of such occurrences, as in Figure 3 (B).In the repository's README file, the racially charged terms 'black-box' and 'native' are found to be used twice each, while the ableist term 'cripple' is used once, by the Pattern-Matching algorithm.The violent term 'hang' was identified once by the Sub-String Pattern-Matching algorithm, however this was found to be a false-flag, as it occurred in the word 'changes'.GitHubInclusifier goes beyond simply pointing out instances of non-inclusive language; it offers John suggestions of alternative wordings, using a large language model (LLM) of his choice -especially useful for altering documentation files.Using the llama-7b LLM, John obtains suggestions for how to reword the non-inclusive language instances found in the README file, as illustrated in Figure 3 (C).For code files, GitHubInclusifier offers a code-specific report, allowing John to pinpoint usages of non-inclusive language in such a way that is tailored to the specific language.Within a number of code files, a function containing the violent term 'terminate' is located, while a variable named 'abort_signal' -also containing violent language -is found.Using GitHubInclusifier's code refactoring feature, as shown in Figure 4 (A), John asks to automatically remove these non-inclusive language instances, without needing to manually read or edit the code himself.GitHubInclusifier uses a third party Python refactoring tool to do this refactoring, and generates a commit and a pull request with a detailed report of the changes (see Figures 4 (B, C)).This level of automation ensures that the inclusive language improvements are integrated into the project, and ensure that the problems identified within the repository can be rectified with minimal effort on John's part.and pixel dungeon -each with over three thousand stars, to perform a non-inclusive language analysis on the artefacts in each.The goal was to see how many instances of non-inclusive language were identified by GitHubInclusifier in the repositories, to determine if using GitHubInclusifier could assist in making a marked difference to each one, through its suggestion and refactoring features.Overall, GitHubInclusifier flagged 451 suspected occurrences of non-inclusive language using the WBPM algorithm.3,283 were found by the SSPM algorithm, across the four repositories.The most commonly identified type of non-inclusive language was biased language (200 occurrences), followed by racially charged (114 occurrences), violent (94 occurrences), and ableist (41 occurrences) language by WBPM (see Figure 5 (i)).The most commonly identified terms by WBPM were 'normal', 'disabled', 'master', 'special', and 'kill' (see Figure 5 (ii)).We manually checked these WBPM results and all were true positives.Figure 5 (iii) shows a breakdown of whole word matching approach found non-inclusive terms in each of the four repositories.Figure ?? shows number of non-inclusive words found by whole word matching per file (top number of occurances only shown for WBPM technique).There is some variation in the artefacts with potential non-inclusive language e.g.many .javafiles in termux and pixel dugeon, but README, CONTRIBUTING, CHANGES etc in bitcoin wallet and brave.

PRELIMINARY EVALUATION
The same categories of non-inclusive language were most often found by SSPM too, albeit in a different order (2018, 636, 351, and 177 occurrences of violent, biased, racially charged, and ableist language, respectively).Terms 'hang', 'kill', 'normal', 'special', and 'hit' were most frequently found by the SSPM algorithm.However, the very high frequencies of some of these terms is due to the fact that they are substrings of other commonly used, not non-inclusive terms, such as 'hang' being a substring of 'change'.Thus in practice the SSPM approach flags many false positives.However, it does pick up many non-inclusive language examples, as shown in Figure 1, of composite code names.Further refinement of our SSPM algorithm could reduce false positives e.g.looking at only capitalised words in class/method/variable names; underscore delimiters, etc, at the risk of missing some true positive non-inclusive language.
We evaluated our Python refactoring tool on the above four Python programs hosted in GitHub repositories, as well as GitHu-bInclusifier's own repository.When the user is offered a code refactoring change to class, method and property names, GitHubInclusifier successfully applied these to all declaration and usage instances in the whole programme and created a repository commit to reflect these changes.However, it currently requires a separate user request to update code comments and related documentation artefacts to reflect the code refactorings made.

SUMMARY
GitHubInclusifier is a proof of concept tool to aid detecting and correcting use of non-inclusive language in a range of GitHub repository artefacts, including code.The high incidence rate of non-inclusive language within the evaluated repositories implies that GitHubInclusifier could play a valuable role in improving these virtual workplaces such that they are more welcoming, respectful and inclusive to individuals from different walks of life.Our future work includes more precise location of non-inclusive language, supporting other language refactoring tools, making code and documentation updates in a single user request, refinement of LLM prompts to aid suggestion of non-inclusive language usage, and further repository artefact support for analysis.
Acknowledgements: Many thanks to Christian Marchetta and Mohak Malhotra who implemented a preliminary inclusive language README file checking algorithm in their 2022 FIT4003 project.Grundy and Todd were supported by ARC Laureate Fellowship FL190100035.

Figure 1 :
Figure 1: (a) Non-inclusive ableist language in README.mdfile; (b) violent language method name and comment; (c) violent language method name, comment and code.

Figure 5 :
Figure5: (a) # potential non-inclusive language found WBPM vs SSPM techniques; (b-e) most common non-inclusive words per repo (WBPM); (f) most common non-inclusive terms overall (WBPM); (g-j) # non-inclusive terms per file per repo (WBPM) and pixel dungeon -each with over three thousand stars, to perform a non-inclusive language analysis on the artefacts in each.The goal was to see how many instances of non-inclusive language were identified by GitHubInclusifier in the repositories, to determine if using GitHubInclusifier could assist in making a marked difference to each one, through its suggestion and refactoring features.Overall, GitHubInclusifier flagged 451 suspected occurrences of non-inclusive language using the WBPM algorithm.3,283 were found by the SSPM algorithm, across the four repositories.The most commonly identified type of non-inclusive language was biased language (200 occurrences), followed by racially charged (114 occurrences), violent (94 occurrences), and ableist (41 occurrences) language by WBPM (see Figure5 (i)).The most commonly identified terms by WBPM were 'normal', 'disabled', 'master', 'special', and 'kill' (see Figure5(ii)).We manually checked these WBPM results and all were true positives.Figure5(iii) shows a breakdown of whole word matching approach found non-inclusive terms in each of the four repositories.Figure ?? shows number of non-inclusive words found by whole word matching per file (top number of occurances only shown for WBPM technique).There is some variation in the artefacts with potential non-inclusive language e.g.many .javafiles in termux and pixel dugeon, but README, CONTRIBUTING, CHANGES etc in bitcoin wallet and brave.The same categories of non-inclusive language were most often found by SSPM too, albeit in a different order (2018, 636, 351, and 177 occurrences of violent, biased, racially charged, and ableist language, respectively).Terms 'hang', 'kill', 'normal', 'special', and 'hit' were most frequently found by the SSPM algorithm.However, the very high frequencies of some of these terms is due to the fact that they are substrings of other commonly used, not non-inclusive terms, such as 'hang' being a substring of 'change'.Thus in practice the SSPM approach flags many false positives.However, it does pick up many non-inclusive language examples, as shown in Figure1, of GitHubInclusifier was evaluated by linking it to clones of four popular open source GitHub repositories -termux, bitcoin-wallet, brave