Refactoring in Computational Notebooks

Due to the exploratory nature of computational notebook development, a notebook can be extensively evolved even though it is small, potentially incurring substantial technical debt. Indeed, in interview studies notebook authors have attested to performing ongoing tidying and big cleanups. However, many notebook authors are not trained as software developers, and environments like JupyterLab possess few features to aid notebook maintenance. As software refactoring is traditionally a critical tool for reducing technical debt, we sought to better understand the unique and growing ecology of computational notebooks by investigating the refactoring of public Jupyter notebooks. We randomly selected 15,000 Jupyter notebooks hosted on GitHub and studied 200 with meaningful commit histories. We found that notebook authors do refactor, favoring a few basic classic refactorings as well as those involving the notebook cell construct. Those with a computing background refactored differently than others, but not more so. Exploration-focused notebooks had a unique refactoring profile compared to more exposition-focused notebooks. Authors more often refactored their code as they went along, rather than deferring maintenance to big cleanups. These findings point to refactoring being intrinsic to notebook development.

the Eclipse IDE, they drew a comprehensive picture of how software developers refactor. They found that traditional developers refactor frequently, favoring refactoring as they develop over concentrated overhauls. They also found that despite the availability of numerous refactoring operations in the IDE, developers overwhelmingly chose to refactor by hand, even experts who were well-versed in the refactoring tools. Finally, they found that Rename was by far the most applied refactoring, and that the vast majority of applications comprise just a few refactoring operations (Rename, Extract Local Variable, Inline Method, Extract Method, Move, and Change Method Signature), with a long tail of lightly used refactorings.
An important take-away from Murphy-Hill et al. is that, since traditional developers overwhelmingly choose to refactor by hand, the lack of refactoring tools in Jupyter is not necessarily a deterrent to refactoring by notebook authors.

End-user Programmers
Little is known about whether or how end-user programmers or exploratory programmers refactor. Stolee and Elbaum found that Yahoo! Pipes programs would benefit from refactoring and developed a tool to automatically detect and refactor "smelly" pipe programs [32]. In a lab study they found that pipe developers generally preferred refactored pipes, although there was a preference for seeing all the pipes at once rather than abstracting portions away into a separate pipe. Badame and Dig similarly found that spreadsheets would benefit from refactoring and developed a refactoring tool for Excel spreadsheets in which spreadsheet authors could select and apply refactorings to selected cells [1]. As in the pipes study, a lab study found that the refactored spreadsheets were preferred, but also preferred seeing all the cells versus abstracting away details. Refactorings with their tool were completed more quickly and accurately than by hand. These studies tell us that end-user programmers want a level of control over refactoring to achieve their preferred results. However, they do not tell us whether end-user programmers refactor in the wild.

Computational Notebook Authors
Rule et al. examined over 190,000 GitHub repositories containing Jupyter notebooks, finding that the average computational notebook was short (85 lines) and that 43.9% could not be run simply by running all the cells in the notebook top-to-bottom [31]. They also found that most notebooks do not stand alone, as their repositories contain other notebooks or instructions (e.g., a README file). Sampling 897 of these repositories (over 6,000 notebooks), Koenzen et al. found that code clones (duplicates) within and between notebooks of a GitHub repository are common, with an average of 7.6% of cells representing duplicates [20]. These results suggest, at least for notebooks on GitHub, that while most notebooks are small, they are not trivial computational artifacts.
Interview studies have shed additional light on the characteristics of the development of computational notebooks. Chattopadhyay et al. interviewed and surveyed data scientists regarding the challenges they encountered in notebook development [6]. Among many concerns, their respondents reported that refactoring was very important, yet poorly supported by their notebook environments. The study does not report on their actual coding practices. Because we sampled from public GitHub notebooks, our study reports on the evolution of notebooks not only from data scientists, who often have significant training in computer programming, but also from authors from a variety of other backgrounds. Rule et al. interviewed what they called notebook analysts (i.e., authors of Exploratory Analysis notebooks), many of whom said they need to share their results with others and sometimes wish to document the results of their work for their own use [31]. In this regard, notebooks are valuable in enabling a narrative presentation of code, text, and media such as graphs. However, notebooks are developed in an exploratory fashion, resulting in the introduction of numerous computational alternatives, often as distinct cells or the iterative tweaking of parameters. Many analysts attested that after their exploratory analysis is complete, their notebooks are often reorganized and annotated in a cleanup phase for their final intended use. Often a notebook is used as source material for a new, clean notebook or a PowerPoint presentation.
In Kery et al. 's interviews, analysts said that code and other elements may be introduced anywhere in a notebook as a matter of preference, but that exploratory alternatives are often introduced adjacent to the original computation [18]. The resulting "messy" notebook is "cleaned" up by cells being deleted, moved, combined, or split, or by defining new functions. Such changes often occur as incremental "tidying" during the exploration phase when the messiness is found to interfere with-slow down-ongoing exploration. When notebooks grow too large for Jupyter's simple IDE support, computations are sometimes moved off into new notebooks or libraries. Building on these findings, Kery et al. developed and evaluated tools for fine-grained tracking and management of code versions within a notebook [16,17].
These interview studies make it clear that notebook analysts feel maintenance is important, because exploratory development creates technical debt that both interferes with productivity and runs counter to the needs of sharing and presentation. Our prestudy identified four other notebook genres, raising the question of whether their notebooks generate similar needs.
As interview studies, however, they establish what a number of notebook authors think they do, serving as hypotheses about what their colleagues really do. It is also unclear whether incremental tidying and concentrated cleanup phases during notebook development are achieved by what would be recognized as refactoring, that is "changing the structure of a program without changing the way it behaves" [26]. If so, then there remain questions of which refactoring operations are favored and how those vary according to notebook genre or author background. The present study provides concrete evidence to evaluate the above hypotheses relating to notebook maintenance.

Refactoring Support for Jupyter
Refactoring support is minimal in the popular Jupyter Notebook and JupyterLab environments. Both support only basic cell-manipulation operations: Split Cell, Merge Cell Up/Down, and Move Cell Up/Down. The refactoring operations most frequently observed by Murphy-Hill et al. are absent: Rename, Extract Local Variable, Inline Method, and Extract Method [26]. As reported by Chattopadhyay et al. (and discussed in the previous subsection), data scientists, at least, found such omissions to be problematic [6].
Since 2020, after we sampled the notebooks for this study, Rename can be added Jupyter Lab by installing the JupyterLab-LSP extension [22]. More recently, in 2021, JetBrains released a thirdparty front-end for Jupyter, called DataSpell, that provides many of the same refactoring operations as its PyCharm IDE for traditional Python development, including those cited above as missing in the Jupyter environments. It also provides the next five most frequently observed by Murphy et al.: Move, Change Method Signature, Convert Local to Field, Introduce Parameter, and Extract Constant. The present study provides insight on the possible usefulness of the refactoring support provided by these notebook environments.
Renaming is an interesting case. To rename a variable in a notebook by hand, one must find and replace each usage. In a complex notebook, it is easy to miss a usage of an identifier. In contrast, developers using a traditional IDE can expect instantaneous feedback when a variable is used without being defined. Even without IDE support, they will become aware of it at runtime. However, in the notebook-kernel execution model, this feedback can be even more delayed, because renaming an identifier amounts to introducing a new identifier. That is, after renaming, there are two potential problems: The old identifier (name) remains bound to its value in the runtime kernel, and the new identifier is uninitialized. Regarding the former, a cell mistakenly referring to the old identifier would access this old value, perhaps disguising the error. Regarding the latter, the notebook author must re-execute the cells upstream of the defining cell to complete the rename. If the notebook author were using the nbsafety custom kernel, then it would suggest what cells to re-execute (when a cell dependent on the rename is executed) [23]. The expectation, however, is that a notebook's refactoring operations would be behavior-preserving, and thus take the dynamic state of the runtime kernel into account.
Head et al. developed a tool for gathering up and reorganizing "messy" notebooks [15]. The tool identifies all the cells involved in producing a chosen result, which can then be reordered into a linear execution order and moved to a new location in the notebook or a different (e.g., new) notebook. Participants in a user study attested that they preferred cleaning their notebooks by copying cells into a new notebook, and that the tool helped them do so, among several other tasks.

STUDY DESIGN
A challenge for our study was how to gain access to the histories of a sizable number of computational notebooks. As one example, data from an instrumented notebook development environment would be attractive, but Jupyter is not instrumented to capture the detailed edit histories from which manual refactorings could be detected, nor was it feasible for us to instrument Jupyter for distribution to a sizable number of notebook authors for an extended period of time. In this regard, the GitHub commit histories of public Jupyter notebooks provided an attractive opportunity. One limitation is that it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced. This underreporting pessimizes our results to a degree, and our results should be interpreted in this context. Limitations and threats related to the use of source code control and GitHub are detailed at the beginning of Section 6.
We employed a multi-stage process of automated extraction and analysis, as well as visual inspection, to achieve depth, breadth, and accuracy for the following research questions: RQ-RF: How much and with what operations do computational notebook authors refactor? RQ-GR: How does refactoring use vary by computational notebook genre? RQ-BG: How does refactoring use vary according to computational notebook author background? In particular, do those with a CS background refactor differently than others? RQ-CC: What is the extent of tidying vs. cleanups of code?

Determining a Catalog of Notebook Genres & Refactorings
To determine the viability of the study and fix a stable catalog of refactoring operations throughout a large-scale analysis, we performed a prestudy.
In June 2019, we downloaded Adam Rule et al. 's dataset of 1,000 repositories [31], containing approximately 6,000 notebooks and randomly sampled 1,000. Using notebook metadata downloaded via the GitHub API, we filtered out inappropriate notebooks: those generated by checkpointing (automated backup), lacking 10 commits with content changes, and completed fill-in-the-blank homework assignments. A commit can have no notebook content changes when only notebook metadata has changed, e.g., a cell's execution count. Completed fill-in-the-blank homework assignments are uninteresting, because they were designed to not require evolution. From the remaining notebooks, we selected the 50 notebooks with the most evolutionary development in terms of the number and size of their commits.
We next visually inspected every commit in the 50 notebooks. The visual inspection of a GitHub commit is normally enabled by the display of a "diff" computed by git diff, which provides a concise summary of what has been added, deleted, and changed since the previous commit. Because a Jupyter notebook file is stored in JSON format and includes program data and metadata, we instead used nbdime, a tool for diffing and merging Jupyter notebook files.
Using nbdime, we visually inspected each notebook, recording a commit number and the context for every change, except those that were merely adding code. Among the changes recorded were manipulations of code and cells and changes to function signatures and imports. In the end, 35 of the 50 notebooks contained such notable changes.
To extract a catalog of refactoring operations from these changes, we drew from existing classical refactorings, as well as defining notebook-specific refactorings where necessary. One author inspected all commit diffs and mapped them to refactorings as defined by Fowler [10,11] and refined by Murphy-Hill et al. [26]. The same author analyzed the remaining changes to define notebookspecific refactorings. A mapping was brought to all authors for evaluation of their validity as a refactoring as defined in Fowler-a structural change lacking external behavioral effects-and finalized their classification according to structural distinctiveness. In all, we compiled 15 refactorings, as listed on the left of Table 4.
During the above visual inspection, we also determined the purpose, or genre, of the notebook examining the entire notebook at its first and last commit. Through iterative refinement, we settled on five genres. The Exploratory Analysis genre is the classical use-case for a computational notebook, an attempt to understand a dataset through computational analysis, iteratively applied until an understanding is reached. Such a notebook might support the publication of a research article or making a policy decision. The other genres operate largely in service of this application. The Programming Assignment genre consists of notebooks developed to complete a programming assignment, but not the fill-in-the-blank type. Many are developed as a final project, and thus are relatively open-ended. However, the primary purpose of writing the notebook is to learn a subject like exploratory analysis. Analytical Demonstration notebooks demonstrate an analytical technique, such as a method for filtering out bad data. Technology Demonstration notebooks demonstrate how to use a particular technology, such as TensorFlow. Finally, Educational Material notebooks support the conduct of a course, serving as "literate computing" lecture notes, textbooks, or labs, for example. There are two sub-genres here: notebooks teaching exploratory analysis and those teaching a traditional academic subject, such as chemistry.

Selecting Notebooks for the Study
Targeting a sample of 200 notebooks for analysis, we first randomly sampled 15,000 Jupyter notebooks from the 4,343,212 publicly available on GitHub at the time, October 2019. We then rejected checkpoint files and repositories with fewer than 10 commits, leaving us with 6,694 notebooks. We downloaded these notebooks, allowing us to automatically cull those with fewer than 10 nonempty commits, leaving us with 278. We then employed random sampling until we reached 200 notebooks, not only rejecting notebooks for insufficient evolution as in the prestudy, but also for having the same primary author as a previously selected notebook, as our goal was to identify a diverse set of notebooks.
Encountering two notebooks from an author was likely because GitHub users commonly copy someone else's repository, with all its history, called a clone. Cloning is particularly likely for online textbooks and other educational materials, such as labs, which may have thousands of clones. These repositories can contain hundreds of notebooks, most by the same author, further increasing the chance of sampling two notebooks by the same author. The primary authorship of a notebook was determined by visual inspection, as described below in Section 3.4.
In the end, to reach 200 notebooks, we sampled 273 of the 278, rejecting 66 for insufficient evolution and 7 for repeat authorship. We encountered no repeat (cloned) notebooks. Six of the 7 repeat authors were for Course Material notebooks. The 7th was a low-probability event: a second notebook from an author who had published only three notebooks, none cloned. The 14 notebooks remaining unclassified after negotiation were classified in a discussion with the third author.

Identifying Refactorings
We used nbdime on the 200 selected notebooks to find all the refactorings as identified in the prestudy. Each refactoring was recorded as an entry in a master table, consisting of the unique notebook ID, notebook genre, commit hash, and refactoring operation code. Due to the large number of commit diffs inspected, identifying refactorings was subject to error due to fatigue. Consequently, a second author audited the accuracy of the inspector's identification of refactorings by randomly sampling and reexamining 10% of the notebooks (20) and their commit diffs (345). As the data is segmented according to commit diff, the error rate is calculated as the number of reinspected diffs that contained any kind of coding error (missed refactoring, extra refactoring, or misidentified refactoring) divided by the total number of commit diffs that were reinspected. Using the same Negotiated Agreement protocol employed exhaustively for genre, described immediately below, 34 diffs were determined to contain some kind of error, an error rate of 9.9%. Of those errors, over half were due to missing a refactoring (21 commit diffs). Next were extra refactorings (10 commit diffs), with only 1 commit diff containing a mislabeled refactoring and 2 commit diffs containing multiple of these errors. The most common missed refactoring was Rename (11 commit diffs, including 1 containing multiple different errors), perhaps because the textual difference of a Rename is more visually subtle than others. Still, Rename is ranked as one of the top refactorings in our results.

Identifying Genre and Author Background
Due to the complexity of determining a notebook's genre, we applied a coding methodology called Negotiated Agreement [12], which produces 100% agreement through a labor-intensive process. First, two authors made their own passes over all the notebooks. At a minimum, they examined the entirety of each notebook at its first and last commits, as well as examining the repository's README file for clues as to the intended use of the repository. The two authors then discussed each notebook on which they disagreed to attempt to reach a common conclusion. For the remaining notebooks where disagreement or uncertainty persisted, the third author was brought in to the discussion for a final decision. Table 1 details the results of each pass.
We also wanted to learn the technical background-primary expertise-of each notebook author, especially whether an author was a computer scientist or not. For this purpose, the categories of background were construed broadly (e.g., Life Sciences). To determine a notebook author's background, one of the authors of this article inspected each notebook and its repository for author information.
The first consideration was to identify the primary author of each notebook. In many cases the owner of a repository is not the notebook author, so it was necessary to inspect the notebook's commit history. In some cases one person originated a notebook and then later someone took it over. In most cases, the originating author wrote the vast majority of the notebook and the authorship was easily attributed to the originating author. In the case of relatively equal contributions, the tie went to the originating author.
The next consideration was actually determining the primary author's background. We decided that the best indication of one's primary expertise is their profession, as indicated by their role or job title, as they are quantifiably expert enough to get paid to use their expertise. An author's educational background was consulted when the job title was vague, such as "Senior Engineer. " Students were classified according to their program-level academic affiliation. This process required extensive research. The inspector, the same author who determined each notebook's primary author, started with the author's GitHub profile, and, if necessary, e-mail addresses extracted from commits. We then searched the internet for publicly available data including personal websites, blog posts, company and university websites, and public information provided by LinkedIn (i.e., without use of a login). The inspector was able to determine the backgrounds of 189 of the 200 notebook authors. The results are shown in Table 5a.
To assess the accuracy of the result, a second author audited the inspector's determination of primary author and their background by randomly sampling and reexamining 20% of the notebooks, 40 total. Using the same Negotiated Agreement protocol described above, the error rate was determined to be 5%. The two misclassified backgrounds were for notebook authors with a multidisciplinary job role and education.

Extraction of Notebook Statistics
To perform automated analyses such as the density of commits over time, we mined the Git history of the notebooks and their repositories and then developed Jupyter notebooks. Analyses of code employed the RebBaron library [30].

Data Availability
Data relating to our notebook selection process and supporting the results in this and the following sections are available under a CC-BY 4.0 license at https://figshare.com/s/4c5f96bc7d8a8116c271.

RQ-RF How Much: Computational Notebook Authors Refactor, with High Variation.
To understand whether computational notebook authors refactor and how much, we extracted the number of refactorings performed and the number of commits containing those refactorings, per genre. Recall that, with this methodology, it is not possible to detect refactorings that occur in the same commit in which the refactored code is introduced, underreporting the total amount of refactoring. Table 2 shows that notebook authors refactor, with 13% of commits containing a refactoring (about 1 in 7.7). Of those commits, they contain an average of 1.30 refactorings per commit. Figure 1 enumerates the frequencies of refactorings within refactoring commits, with a maximum of 6 in a single commit. Over 100 notebooks contain 2 or fewer refactorings. The maximum number of refactorings on a notebook is 21.

RQ-RF Operations: Computational Notebook Authors Favor a Few Non-object-Oriented Refactorings.
To learn what refactoring operations are employed by notebook authors, we clustered our extracted refactorings by operation type and sorted them by frequency, as shown in the rightmost column in Table 4. Notably, just a few refactoring operations account for most of the applied refactorings. The top four operations account for over 57% of refactorings: Change Function Signature, Extract Function, Reorder Cells, and Rename. Including the next one, Split Cell, reveals that a third of refactoring operations account for over two-thirds of refactorings. Notably, two of these top five refactoring operations (and three of the top six) refactor cells, which are unique to computational notebooks. All of the cell-refactoring operations together comprise 40% of the total number of observed refactorings.

RQ-GR: PAs Appear Exploratory; Exploratory and Expository Genres are Distinct
As discussed in Section 2.3, Exploratory Analysis (EA) notebooks are the classic use-case for computational notebooks. The genres of Educational Materials (EMs), Technology Demonstrations (TDs), and Analytical Demonstrations (ADs) are different in their focus on exposition for a wide audience rather than the exploration of a novel dataset. This raises two questions: (a) Are Programming Assignments (PAs) refactored more like Exploratory Analyses or the exposition-focused genres, and (b) to what extent are these genres distinct from each other when it comes to refactoring? Regarding the first, although Programming Assignments are often exploratory analyses, they are not truly open-ended, novel investigations. Regarding the second, evolution (refactoring) in exposition-oriented notebooks would be expected to be driven more by changes in technology or the goal of communicating clearly to a wide audience, for example, than the effects of exploration, which would be expected to be generally absent. In the following, due to the small number of Analytical Demonstration notebooks, we omit them from the following analysis.
At a first level of analysis, we see similarities in refactoring between the Exploratory Analysis and Programming Assignment genres, and differences with the expository genres. Referring back to Table 2, for one, 16% of their commits contain refactorings, whereas for the others, just 9% do. Two, they have 13% and 10% of notebooks each with no refactorings, whereas the expository genres hover around 30%. Their percentage of refactoring commits that contain only refactorings is well below the other genres. The expository genres exhibit similarities among each other as well, but there are some differences. Of the Technology Demo's refactoring commits, 7.8% were refactoringonly commits, compared to 5.6% for Educational Materials. Likewise, for the Technology Demos, Rename refactoring occurs three times as often and Reorder Cells a quarter.
To further explore this possible exploratory-expository genre clustering, we applied statistical tests for both refactoring rate and the profile of selected refactorings. Note that these tests are employed here as descriptive statistics, as we did not have a specific hypothesis at the outset.
To reveal differences in refactoring rates between genres (excluding AD), we computed the average rate of refactoring per commit for each notebook and ran a Kruskal-Wallis H Test followed   [9,21]. We report these results in Table 6 (A). There is strong evidence that the rates of refactoring differ for pairwise combinations that are not between exploratory genres (EA vs. PA) and expository genres (EM vs. TD). Within the like-genre pairings, we find no evidence to suggest a difference in refactoring rates.
To examine the distinctiveness of the refactoring profile of each genre (excluding AD), we ran Pearson's χ 2 tests of independence for each pair of genres [27]. These tests excluded the Analytical Demos and were run for the top 10 refactorings overall-excluding the bottom 5-because Pearson's χ 2 test assumes non-zero observation counts (cf. Table 4). 2 We report these results in Table 6 (B). Our results suggest that, except for the Exploratory Analysis/Programming Assignment pairing (p = 0.192), all genres' refactoring distributions are pairwise distinct. According to the obtained Cramér's V values, the effect sizes are modest.
The pattern observed in the pairwise ad hoc tests suggests that the exploratory genres might be distinct from the expository genres in terms of refactoring rates and types. Analogous to the above, a 2-way Kruskal-Wallis H test comparing aggregated exploratory (EA+PA) and expository (EM+TD) notebooks supports this, as these two groupings refactor at different rates (p < 0.001), while a χ 2 independence test for the distinctiveness of these meta-genres' refactoring profiles was significant (p = 0.002).
Given the consistent patterns observed between the overall distribution of refactorings and the profile of selected refactoring operations, we find strong evidence that Programming Assignments, from a refactoring perspective, cluster with Exploratory Analyses into an exploratory meta-genre, and that the Educational Materials and Technical Demonstrations cluster into an expository meta-genre.

RQ-BG: Computational Notebook Authors with a Computing Background
Refactor Differently, but Not More Table 3 summarizes refactoring behavior according to author background, and Table 5b further breaks down the use of refactoring operations. For completeness, Table 7 provides a breakdown of author background according to notebook genre. Overall, we see a striking similarity between the Data Science category and the Computer Science category. The major difference that we observe is that Data Scientists tend to favor Extract Module over Extract Class (6.4% versus 1.7%), compared to Computer Scientists (1.5% versus 6.6%). This could be attributed to computer scientists being influenced by their wide use of object-oriented practices in traditional development. Table 9 highlights this, showing that CS & IT authors use classes much more than others. Still, even CS & IT authors use classes lightly: Only 6 of their 40 notebooks contain even a single class declaration. Those outside of Data Science and Computer Science ("Other" in Table 5b) are not especially unique, either, except that they employed Change Function Signature and Split Cell rather more, and Extract Function, Reorder Cells, and Rename rather less. The prevalence of Change Function Signature is almost entirely due to those with backgrounds in Mathematics and Finance, a small minority of the notebook authors in the Other category.
As shown in Table 8, pairwise χ 2 tests on the refactoring distributions by author category bear out these observations. Only the Data Science and Other pairing presents a statistically significant difference (p = 0.037), with a modest effect size (V = 0.19). There is scant evidence to suggest that Computer Scientists refactor distinctly from Others (p = 0.166, V = 0.16) or Data Scientists (p = 0.473, V = 0.12). 1 The overall distribution of refactorings per commit fails a normality test, so ANOVA is inappropriate. 2 We additionally applied Yates' continuity correction to avoid overestimation of statistical significance or effect size in light of the small frequencies (< 5) for some of the less common refactorings [35].   We additionally compared all of those of with computational backgrounds (CD+DS) and those with non-computational backgrounds (Other), and find similarly significant evidence to suggest differences in refactoring (p = 0.037, V = 0.19). 3 Overall, then, we find strong evidence that notebook authors with computing-related backgrounds refactor differently than their non-computing counterparts, but they do not refactor more. This suggests that the rate of refactoring is mostly influenced by the evolutionary characteristics of the notebook genre, such as exploration, exposition, and changes to underlying technology.

RQ-CC: Computational Notebook Authors Mostly Tidy Code as They Go; EA Code
Cleaned Up More We sought to quantify the self-reports of incremental tidying and bigger cleanups for Exploratory Analysis notebooks [18,31] as discussed in Section 2.3, with respect to the evolution of code, and compare these to other genres. Already, in Section 4.3, we observed that Programming Assignments exhibit an exploratory character when it comes to the distribution of refactorings across commits and the selection of refactoring operations.

Exploratory Notebooks Are Not Refactored the Most.
We take it as a given that refactoring is an act of tidying or cleaning-an attempt to alter the look and arrangement of code to better 77:14 E. S. Liu et al.  reflect the concepts embodied in the code or ease future changes. As shown in Table 2, Exploratory Analyses do not stand out in the amount of refactoring applied to them, whether measured relative to the number of commits or notebook code size (Educational Materials do).

Computational Notebook Authors "Floss" Much
More than "Root Canal. " Murphy-Hill and Black distinguish refactorings that are employed to keep code healthy as a part of ongoing development and refactorings that repair unhealthy code [25]. They name these two strategies floss refactoring and root-canal refactoring, respectively. The distinction is important, because their relative frequency says something about developer priorities regarding the ongoing condition of their code. One might suppose that the typical notebook author is unaware of the importance of keeping their notebook code structurally healthy, and so might postpone refactoring until the problems are acutely interfering with ongoing development.
Murphy-Hill et al. operationalized floss refactorings as those that are committed with other software changes, and root-canal refactorings as those that are committed alone [26]. While perhaps a low bar for identifying root canal refactoring, still, only 4.8% of notebook commits containing refactorings are refactoring-only, with the remaining 95.2% of commits containing refactorings being classified as flossing ( Table 4, column 8). Of the 23 refactoring-only commits, 10 come from from Exploratory Analysis notebooks, 4 from Programming Assignments, 4 from Educational Material, 4 from Technical Demonstrations, and 1 from Analytical Demonstrations (see Figure 2). Exploratory Analyses' 10 refactoring-only commits, at 4.5% of all their refactoring commits, is slightly below the average of 4.8%. The 23 refactoring-only commits contain 28 refactorings in total, 1.22 refactorings per commit on average, a bit lower than the rate of 1.30 for commits containing both refactorings and other changes. The frequency of refactoring operations is shown  Table 10. It is notable that Extract Function did not occur in the refactoring-only commits, given its popularity generally (Table 4). A commit that contains many refactorings rather than solely refactorings-a quasi-root-canal classification-could also be a sign of cleanups. As shown in Figure 1, less than a quarter of refactorings contain two or more refactorings, and only 25 contain three or more, and the most refactorings in one commit is six. Exploratory Analysis notebooks have a slightly above average number of multi-refactoring commits, and the 6-refactoring commit belongs to an Exploratory Analysis notebook. Although it is hard to define what would be enough refactorings in one commit to be evidence of a concerted cleanup, three would seem to be a low bar. Five or six is more interesting, but we saw just four of these.
Overall, then, there is substantial evidence of tidying via floss refactoring, and little evidence of cleanups via root canal refactoring.

Computational Notebook Authors Do Not Perform Much Architectural
Refactoring. Architectural refactorings could be a signal of code cleanups, as they alter the global name scope of the notebook by creating a new scope and/or adding/removing entities from the global scope. The architectural refactorings observed are Extract Function, Extract Module (which extracts functions or classes into a separate Python file), Extract Class, and Extract (Global) Constant. They account for 25% of Exploratory Analysis refactorings, a bit more than the 19% of other genres. Extract Module and Extract Class are especially interesting, as they gather and move one or more functions into a new file and new scope, respectively. Twenty-three Exploratory Analysis refactorings (7%) come from this category, compared to 22 (6%) for the other genres. Although 78% of observed refactorings are non-architectural, we see some support for cleanup behavior, and more so for Exploratory Analyses.

Computational Notebook Authors Refactor Throughout. Rule et al. 's interviewees reported
performing cleanups after analysis was complete [31], whereas Kery et al. 's interviewees reported more periodic cleanups in response to accrued technical debt [18]. As such, we assessed when refactoring takes place over time.
We plotted refactoring commits and non-refactoring commits over time, shown in Figure 2. Time is binned for each notebook according to the observed lifespan of the notebook. Interestingly, we see proportional spikes of both refactoring and non-refactoring commit activity in the first and last 5% of observed notebook lifetimes. This can be seen numerically in Table 11, in particular the commits during the middle 90% of notebook lifetime are 7.7 times less frequent than during the first 5% and 3.5 times less frequent than the last 5%. Over one-third of notebook activity happens  during these two periods. Overall, commits containing refactorings occur at a rate that closely tracks the overall commit rate. The relative commit rate for Exploratory Analysis notebooks is nearly identical. By treating each commit on a notebook as a tick of a virtual commit clock, as shown in Figure 2's inset, we can see that refactoring activity, on average, is much more uniform over observed notebooks' commit-lifetimes.
Using this notion of commit-time, Figure 3 plots the frequency of actual refactorings, both architectural (dark blue) and non-architectural (light blue). Since the average number of refactorings per commit is 1.30, it is not surprising that the trend looks highly similar to Figure 2-inset. We observe a slight hump in the middle for both types of refactorings, and the rate of architectural refactoring closely tracks the non-architectural refactoring rate. Also, the commit-time rate of Exploratory Analysis architectural refactorings closely mirrors that of all notebooks (not shown).
The rate of refactoring over time supports the hypothesis of ongoing tidying. To the extent that architectural refactoring is a signal of cleaning, the data provides evidence for ongoing cleaning as opposed to post-analysis cleanups.

Computational Notebook Authors Modestly Comment Out and Delete Code.
In the abovementioned interview studies, Exploratory notebook authors attested to commenting out deprecated code as part of cleaning [18,31]. We observed 151 commits in 92 notebooks in which code was commented out. Although consequential, this rate is much lower than the number of refactoring commits (475) and notebooks that contain refactorings (160). Still, 82 of those commits occurred in Exploratory Analysis notebooks, 5.8% of their commits, twice the rate of the other notebook genres. Although the numbers are small, they suggest that Exploratory Analysis notebooks are undergoing more tidying or cleaning up of this sort.  Deletion of code can also be cleaning. We measured deletions of non-comment code in commits (Figure 4). The median commit on an Exploratory Analysis notebook that results in a net deletion of code deletes a sizable 3.1% of a notebook's code. The next highest is Technology Demonstrations at 1.8%. The outliers (diamonds) are especially interesting, as they suggest large deletions indicative of cleanups. Each genre has about 20% outliers, suggesting that Exploratory notebooks do not stand out in this regard. Exploratory Analyses have 48 outlier deletions, 0.61 per notebook.
Finally, we measured how much smaller a notebook's non-comment code size is between its maximum and its last commit. If code size shrinks substantially, then that is a sign of cleanups. Following a Pareto-like distribution, 20% of Exploratory Analysis notebooks shrink more than 22%, whereas 20% of other genres shrink only a little more than 10%. This is the strongest case for Exploratory Analysis notebooks undergoing cleanups distinctly from the other genres.

Summary for RQ-CC.
Taken together, we see ample evidence of code tidying and some evidence of code cleanups across all genres. Exploratory Analysis notebooks see more code deleted and commented out. On average, they apply more architectural refactorings and slightly more multiple-refactoring commits. The preponderance of observed code tidying is especially notable, because any code that was introduced and refactored in the same commit-de facto code tidyingwas not observable. This is discussed further in Section 6.

Refactoring: Intrinsic to Notebook Development Despite Small Size
Genres exhibit unique refactoring profiles, perhaps due to their distinct evolutionary drivers, but refactoring was observed in most notebooks of all genres. We were surprised to see substantial similarities in refactoring behavior among data scientists, computer scientists, and those from other backgrounds such as physical scientists. Likewise, our observation of a broad practice of floss refactoring, a best practice in traditional software development, is notable, as was the broader pattern of ongoing maintenance over big clean-ups. These suggest that the pressures of technical debt are motivating notebook authors to perform regular notebook maintenance.
We surmise that refactoring is intrinsic to notebook development, despite the small size of notebooks. Belady and Lehman's model for software development predicts that entropy increases exponentially with each change [2], and exploratory development tends to introduce many (often small) changes. Their model predicts that computational notebooks will experience increasing difficulty in making changes, eventually being forced to refactor or abandon the notebook (perhaps by copying essential code into a new notebook). 4

Notebook Authors Refactor Less, and Differently, than Traditional Developers
As observed in Section 4.2.1, 13% of commits for the notebooks in this study contain refactorings. This contrasts with the 33% found by Murphy et al. in their study of CVS logs of Java developers [26,Section 3.5 ]. Since we found that those with a computer science background actually refactored their notebooks a little less than others (see Section 4.4), it appears that unique characteristics of computational notebooks such as their typical small size are influential. Another influence could be that notebook authors commit their changes to GitHub less frequently, hiding more floss refactorings (see Section 6.1).
Furthermore, as observed in Section 4.2.2, notebook authors rely heavily on a few basic refactorings. The same trend is seen amongst traditional Java developers, but with a distinct selection of refactoring operations. In Murphy-Hill et al. 's analysis of CVS commit logs of manual refactorings of Java code, the top four refactoring operations were, in order, Rename (29.7%), Push Down Method/Field (14.5%), Generalize Declared Type (9%), and Extract Function/Method (6.9%) [26, Figure 2]. We observe the following:

Rename Is Much Less Common in Notebook Development.
For the traditional developers, Rename occurred more than twice as often as the next refactoring, whereas in our sample of notebooks Rename came in fourth at 10.5%. Although the small size of the typical notebook compared to a Java application might be a partial explanation, the typical notebook makes heavy use of the global scope, putting pressure on the notebook author to maintain good naming as the notebook grows and evolves. We hypothesize that the difficulty of renaming identifiers in the Jupyter IDE, as discussed in Section 2.4, is a deterrent to renaming in notebooks.

Object-oriented Refactorings Are Much Less Common in Notebook Development.
For traditional developers, the next two most common refactorings, Push Down Method/Field and Generalize Declared Type, are core to object-oriented development. Python supports objectoriented development, but as discussed in Section 4.4, it is not widely practiced in the notebooks in this study.

Cells Are a Key Structural Affordance in Notebook Evolution.
Although the wide use of Jupyter's unique cell construct is not surprising, it is perhaps more surprising that the refactoring of cells is so common, at 40% of the total. However, as shown in Table 8, the occurrence of code cells is over seven times greater than functions for the notebooks in this study, the next most common construct. What we are likely observing is that cells are displacing functions, in comparison to traditional development. Another factor is that Reorder Cells, Split Cell, and Merge Cells are supported in the Jupyter IDE, unlike the other observed refactoring operations.

Need for Better Refactoring Support in Notebook IDEs
Implementing tool assistance for refactoring is non-trivial, and IDE developers might be hesitant to do so without evidence that the benefits outweigh the costs. The presence of refactoring in 80% of the computational notebooks sampled for this study argues for at least some support for traditional refactoring tools that enable author-directed refactorings. In particular, our results suggest that notebook authors using Jupyter would benefit from environment support for at least the three refactorings in the top six that are not yet automated, Change Function Signature, Extract Function, and Rename.
Rename is particularly challenging to perform manually in a computational notebook, because old symbols retain their values in the environment until the kernel is restarted or the symbol is removed by the notebook author. A proper Rename refactoring would also rename the symbol in the environment, not just the code.
Similarly, for Reorder Cells, which is supported only syntactically in computational notebooks, support for checking definition-use dependencies between reordered cells [14] could help avoid bugs for the many computational notebook authors who use that refactoring.
As observed in Section 2.4, JetBrain's new DataSpell IDE provides a subset of the refactorings supported by their Python refactoring engine. Among these are three of the top six refactorings observed in our study, mixed in with several less-useful object-oriented refactorings. The Rename provided by DataSpell does not rename the symbol in the runtime kernel.
Although our results document that phases of code cleanups are not especially common, especially as compared to tidying, one possible reason is the lack of tool support. Without tool support or test suites, a notebook cleanup is high risk, as the substantial changes to a complex notebook could lead to hard-to-fix bugs. In this regard, our results argue for cleanup support like code gathering tools [15] (see Section 2.4).
However, a customized mix of refactoring assistance for different genres or authors of differing backgrounds is not strongly supported. Although we observed differences, the effect sizes of the statistical tests are modest and the overall top refactorings observed were performed in sizable percentages in notebooks of both exploratory and expository characters, as well as by authors of differing backgrounds.

Multi-notebook Refactoring
In interviews, notebook authors said that they sometimes keep exploration and exposition in separate notebooks [31], split a multiple-analysis notebook into multiple notebooks [18], or drop deprecated code into a separate notebook [18,31]. We saw evidence of these in the ample code deleted from notebooks (see 4.5.5). Such actions create relationships among notebooks akin to version control. Extrapolating from a suggestion from one of Head et al. 's study participants [15], it may be valuable for notebook refactoring tools to be aware of these relationships, for example, having an option for Rename to span related notebooks.

Future Work on Studies of Notebook Refactoring
The present study focused on refactorings introduced between commits on a notebook. Future work could, for example, use logs from instrumented notebook IDEs to reveal the full prevalence of refactoring, as well as study fine-grained evolution behaviors such as how refactoring tools are used (as Murphy-Hill et al. did for Java developers [26]) or track activity among related notebooks (such as those being used for ad hoc version control). Future work could also investigate which situations motivate the use of refactoring in computational notebooks, as well as the effectiveness of refactoring. The latter could be studied for both manual and assisted refactoring, examining properties such as improved legibility and cell ordering better reflecting execution dependencies, versus, say, the introduction of bugs or a worse design.

LIMITATIONS AND THREATS 6.1 Limitations Due to Use of Source Control
Many computational notebook authors may be unaware of source code control tools. This study omits their notebooks, and our results may not generalize to that population. However, numerous articles in the traditional sciences advocate for the use of tools like GitHub for the benefits of reproducibility, collaboration, and protecting valuable assets [24,28], establishing it as a best practice, if not yet a universal one. Notebook authors who know how to use source code control are more likely to know software engineering best practices as well, for example if they had encountered Software Carpentry [33,34]. By factoring our analysis according to author background, we partially mitigated this limitation.
Notebooks may have been omitted because authors deemed them too trivial to commit to source control. We also excluded notebooks with fewer than 10 commits, the vast majority of notebooks on GitHub. Still, our study includes notebooks with a wide range of sizes and number of commits.
Our dependence on inter-commit analysis also presents limitations. Refactorings that occur in the same commit in which the refactored code is introduced are undetectable. This underreporting may not be uniform across refactoring operations, because intra-commit refactorings are by definition floss refactorings, which could be less architectural in nature. Even so, we observed little root canal refactoring. Additionally, our analysis is more likely to underreport for notebooks whose histories have a relatively small number of large commits rather than many small commits.
Related, some notebooks may not be committed early in their history, resulting in an initial large commit, hiding refactorings. Also, a few of the notebooks in our study are not "done": still in active development or in long-term maintenance. Given the marked bursts in overall commit rate early and late in the histories we captured (Figure 2 and Table 11), we are confident that we observed meaningful lifetimes for the vast majority of notebooks. To further quantify this, we performed an analysis of the similarity of the first and last commit of each notebook. We took a conservative approach, extracting the bag of lexical code tokens for each, and then calculated the similarity as the size of the (bag) intersection of the two commits, divided by the size of the larger bag. Half of the notebooks' first-last commits are less than 34% similar. At the extremes, 47 notebooks' firstlast commits are less than 10% similar, whereas 14 notebooks' first-last commits are more than 90% similar.
A future study could avoid the limitation of using source control snapshots by investigating notebook evolution through an instrumented notebook IDE (see Section 5.5).

Limitation Due to Use of Public Notebooks
Our study examined publicly available notebooks. Notebooks authored as private assets may be evolved differently, perhaps due to having smaller audiences or fewer authors. Likewise, many programming assignments may be completed within a private repository due to academic integrity requirements. These limitations are partially mitigated by factoring our analysis by author background and notebook genre.
In a separate vein, public notebooks on GitHub are frequently cloned, notably textbooks and fill-in-the-blank homework assignments, creating the possibility that our random sample might have selected the same notebook multiple times, skewing our results. Our exclusion of student fillin-the-blank notebooks (Section 3.4) eliminated one class of possibilities. In the end, as discussed in Section 3.2, our random sample contained no duplicate notebooks.

Limitation Due to Single-notebook Focus
Our analysis detected when Extract Module moved code out to a library package and referenced it by import, and we analyzed code deletion in the context of cleanups. Deleted code may have been pasted into another notebook (see Sections 4.5.5 and 5.4), but its destination was not tracked in our analysis. As mentioned in Section 5.5, a future study could track the copying and movement of code across notebooks to better understand their relationship to refactoring.

Internal Validity Threats Due to Use of Visual Inspection
We classified refactorings through visual inspection, which is susceptible to error. Some previous studies of professional developers used automated methods, but they were shown to detect only a limited range of refactorings [26]. We used visual inspection to enable the detection of idiomatic expressions of refactorings in Jupyter notebooks, regardless of the programming language used. Five notebooks did not employ Python. Refactorings have been standardized and formalized in the literature, and the authors are expert in software refactoring. The methods we practiced as described in Sections 3.1 and 3.2 further controlled for mistakes. An audit, as described in Section 3.3, found a 9.9% error rate.
Determining notebook genre was more difficult, as there is no scientific standard for these. As described in Section 3.4, we employed Negotiated Agreement to eliminate errors. Although we can claim our classifications to be stable, others could dispute our criteria for classification. As described in the same section, for notebook author background, an audit found a 5% error rate.

External Validity Threat Due to Sample Size
Finally, to enable a detailed and accurate inspection of each notebook, its commits, authorship, and containing repository, this study was limited to studying 200 notebooks. We randomized our selection process at multiple stages to ensure a representative sample.

CONCLUSION
Computational notebooks have emerged as an important medium for developing analytical software, particularly for those without a background in computing. In recent interview studies, authors of notebooks conducting exploratory analyses frequently spoke of tidying and cleaning up their notebooks. Little was known, however, about how notebook authors in general actually maintain their notebooks, especially as regards refactoring, a key practice among traditional developers. This article contributes a study of computational notebook refactoring in the wild through an analysis of the commit histories of 200 Jupyter notebooks on GitHub. In summary: RQ-RF (Notebook Refactoring): Despite the small size of computational notebooks, notebook authors refactor, even if they lack a background related to computing. From this, we surmise that refactoring is intrinsic to notebook development. Authors depend primarily on a few non-objectoriented refactorings (in order): Change Function Signature, Extract Function, Reorder Cells, Rename, Split Cell, and Merge Cells. Traditional developers coding in languages like Java refactor more than twice as much. They prioritize the same non-cell operations as notebook authors, but apply Rename most frequently and favor a more object-oriented mix of refactorings.
RQ-GR (Refactoring by Genre): Computational notebooks of all genres undergo consequential refactoring, suggesting that the messiness of exploration often discussed in the literature is not the only driver of refactoring. Programming assignments (e.g., term projects) appear rather similar to Exploratory Analyses with respect to how they are refactored, despite their different end purpose. Overall, refactoring behaviors are differentiated by the notebook's exploratory versus expository purpose.

RQ-BG (Refactoring by Background):
Computational notebook authors with a computing background (computer scientists and data scientists) seem to refactor differently than others, but not more. This adds weight to the conclusion above that refactoring is instrinsic to computational notebook development.

RQ-CC (Tidying vs. Cleanups of Code):
Computational notebook authors exhibit a pattern of ongoing code tidying. Cleanups, cited in interview studies, were less evident, although they occur more in exploratory analyses. Cleanups appear to be achieved more often by moving code into new notebooks, a kind of ad hoc version control.
Our results suggest that notebook authors might benefit from IDE support for Change Function Signature, Extract Function, and Rename, with Rename taking the kernel state into account. Also, given the frequency of use of the Reorder Cells operation, notebook authors might benefit from it being extended to check for definition-use dependencies.
To replicate and extend these results, future work could instrument notebook IDEs to log finegrained evolution behaviors and how refactoring tools are used, including across related notebooks. Future work could also study the circumstances that motivate the use of refactoring in computational notebooks, as well as the net benefits of refactoring with respect to factors such as legibility, cell ordering reflecting execution dependencies, and the accidental introduction of bugs.