Distributed Scrum: A Case Meta-analysis

Distributed Scrum adapts the Scrum project management framework for geographically distributed software teams. Experimentally evaluating the effectiveness of Distributed Scrum is impractical, but many case studies and experience reports describe teams and projects that used Distributed Scrum. This article synthesizes the results of these cases using case meta-analysis, a technique for quantitatively analyzing qualitative case reports. On balance, the evidence suggests that Distributed Scrum has no impact, positive or negative, on overall project success. Consequently, claims by agile consultants who present Distributed Scrum as a recipe for project success should be treated with great caution, while researchers should investigate more varied perspectives to identify the real drivers of success in distributed and global software development.


INTRODUCTION
Distributed or global software development refers to any software process in which the staff working on a software project are divided among two or more geographically separated offices.These offices could be in the same city, different cities, or different countries.The term "distributed" usually refers to geographic distribution, but frequently also involves temporal, cultural, and organizational boundaries [64].Global sourcing and distributed software development have become a common business reality and a strategy to develop and improve software systems rapidly [127,128].Many software companies attempt to increase efficiency and productivity by scaling software 100:2 R. de Souza Santos et al.
teams temporally and geographically [110,157].The Covid-19 pandemic appears to have accelerated the already established trend toward more distributed software teams [50,105].
Meanwhile, agile methods have become commonplace in the software industry over the past two decades.Agile methods such as Scrum emerged in the early 1990s to address issues with the plan-driven methods that were prevalent at the time [139]-issues that remain important today: projects are frequently late, over-budget, and do not deliver stakeholder value.These problems can lead to project failure, a topic that has garnered significant interest for decades [1,27,28,44].Agile methods were thus seen as a way to be more responsive to changing circumstances and customer demands, which could help achieve higher levels of project success.
As both agile methods and distributed software development became essential to the software industry, practitioners and researchers began discussing many aspects of their characteristics, dimensions, as well as their combined effects on individuals, teams, and software artifacts [19,54].Much of the discussion focused on the socio-technical practices needed to enact agile process in geographically, temporally, and culturally distributed environments.These discussions eventually gave rise to Distributed Scrum [157].
Distributed Scrum is a software project management framework for geographically distributed environments comprising sets of roles, rituals, and socio-technical practices [153,154,157].Like standard Scrum, it emphasizes rapid iteration, daily communication, and flexible planning.Distributed Scrum attempts to adapt Scrum practices to distributed software projects, which are often more difficult to manage due to the barriers associated with physical distance, multiple time zones, and cultural conflict [64].Sutherland et al. positioned Distributed Scrum as a "Secret Sauce for Hyperproductive Offshored Development Teams" [154].While previous reviews (see Table 4) in this area have offered rich insights into the various intricacies of Distributed Scrum, its impact on project success, and indeed, whether or not Distributed Scrum is a "secret sauce, " remain unclear.
A major challenge in determining whether Distributed Scrum can be linked to project success is the practical operation of such a study.Distributed Scrum and most of its components arein principle-denied critical empirical verification in the form of randomized controlled experiments [123].It is not practical to recruit thousands of professional developers, randomly organize them into numerous treatment groups, and have them build the same project for the same stakeholders using different combinations of practices.Even if we accepted undergraduate students as reasonable proxies for professional developers, the extent to which realistically complex, distributed software projects could be simulated with students is limited [149].Unsurprisingly, then, an early review found little empirical evidence supporting the usefulness of Scrum practices in global software development [67].Rather, most empirical evaluations are case studies or experience reports of adapting regular Scrum to distributed settings.While case study is a legitimate research method, it is ineffective for testing a hypothesis such as "Distributed Scrum enhances software project success." However, a sufficient number of such cases now exist to conduct a case meta-analysis, also called a case survey.Case meta-analysis is a type of systematic review [124] in which qualitative data are quantitatively synthesized to facilitate null hypothesis testing [78,91].In this article, we therefore report on a case meta-analysis to address the following research question: Research question: Which Distributed Scrum practices are associated with software project success?
In the remainder of this article, we discuss Distributed Scrum and previous reviews of agile methods in distributed software development (Section 2).We then describe our approach to case meta-analysis including development of hypotheses, literature search process, data extraction, and analysis procedures (Section 3).Sections 4 and 5 present and discuss our results, respectively.Finally, Section 6 summarizes our contributions.

Role Description Scrum Master
The Scrum Master helps the team understand and implement Scrum and manages the events.The Scrum Master is a facilitator, not a manager, whose role is to remove any friction that a team might experience [107].
Product Owner The Product Owner represents the customer of the software and manages and prioritizes the product backlog [14].

Developers
Developers build and test the software; work items are taken from the sprint backlog, and marked complete when they comply with a "Definition of Done" [146] that the team defines."a list of work items (e.g., user stories, outstanding bugs, various chores) used by software teams to coordinate work to be done" [140].
Sprint Backlog "the Sprint Goal (why), the set of Product Backlog items selected for the Sprint (what), as well as an actionable plan for delivering the Increment (how)" [155].
The Increment the set of Backlog items that the team expects to complete during the current Sprint.
Definition of Done "a formal description of the state of the Increment when it meets the quality measures required for the product" [155].

BACKGROUND AND RELATED WORK 2.1 Distributed Scrum
Scrum was originally proposed as a lightweight and flexible approach to software development for small, co-located teams [130,139,156].The original Scrum framework defines numerous events (or "ceremonies"), roles, and artifacts that provide a basic framework for a software team to follow.Scrum has become hugely popular worldwide and has been adapted for large-scale projects [132] and regulated domains [49].The Scrum framework is frequently adapted to fit specific development contexts, for example, by adding or removing events, roles, or artifacts.This tailoring of methods is ubiquitous [33,47]-the resulting tailored method has been labeled "method-in-action" [46].As larger, globally-distributed software organizations adopted Scrum, various adaptations of the original Scrum framework emerged.Sutherland et al. [157] synthesized these adaptations into what they called Distributed Scrum.However, just as Scrum tends to be adapted by software organizations to specific contexts and preferences [33,47,71], Distributed Scrum is not a fixed, singular approach; it refers to the use of Scrum in a distributed setting while acknowledging that not all organizations adapt it in the same way.That is, organizations select specific practices and events to suit their needs.For example, some organizations combine the Sprint retrospective (focused on the process) with the Sprint review (focused on the software product).
Distributed Scrum has essentially the same roles (Table 1), artifacts (Table 2), and events (Table 3) as regular Scrum.However, whereas co-located teams can have physical backlogs (e.g., using index cards), Distributed Scrum requires virtual backlogs (e.g., using cloud-based project management software).Similarly, having staff in multiple locations leads to variation in how meetings are held.

Event Definition Sprint
A sprint is a fixed-length period during which the development team works on pre-selected work items, which are identified in the Sprint Backlog.

Sprint planning
Prior to each sprint, the team determines the work to be done; the work is selected from the product backlog, and the selected work items are gathered into a sprint backlog.
Daily stand-up A brief (e.g., 15-minute) meeting during which the team discussed their progress, short-term plans, and blockers.The goal of these short meetings is exclusively to share information on progress, rather than solving any issues or otherwise making any decisions.

Retrospective
A meeting during which the team reflects on past events and the Scrum process to improve quality and effectiveness.

Sprint review
A meeting during which the team "presents the results of their work to key stakeholders and progress toward the Product Goal is discussed" [155].
Scrum-of-Scrums (SoS) A brief, regular meeting among Scrum Masters to facilitate inter-site communication and coordination.
Note: All events listed in the table have a dedicated time frame and are planned ceremonies.Other activities that are frequently undertaken within the Scrum approach include, for example, backlog refinement, which does not have an official time frame, but is a more continuous activity led by a Product Owner.
Sutherland et al. [153,154,157] suggested three basic variations: the "Isolated model, " the "Distributed Scrum-of-Scrums model, " and the "Totally Integrated model" (which they later re-labeled the "Fully Distributed Scrum model")-see Figure 1.We consider these archetypes: common variants that themselves can be tailored further.
In the Isolated model, sites are separated, minimizing dependencies and interaction.Each site has its own, separate meetings and artifacts; some sites may not be using Scrum at all.
In the Distributed Scrum-of-Scrums model, each site holds its own meetings but the Scrum Masters from each site meet "regularly" to synchronize and coordinate.Whereas Sutherland et al. seem to suggest that they "meet across geographies" [154], the practical ability to do so will depend on the distance.The basic format of the Scrum-of-Scrums is similar to that of the Daily Stand-up meet-100:5 ing, "except that it deals with teams instead of team members" [113].Sutherland et al. [153] do not comment on whether or not sites using Scrum-of-Scrums share artifacts.It is not clear whether Scrum-of-Scrums also covers the events of sprint planning, retrospective and sprint review meetings, but teams using the Scrum-of-Scrums model are supposed to have a shared definition of done and make occasional site visits for face-to-face interaction.
In the Integrated model, teams are distributed, which means that team members are based at different sites.As such, team members conduct meetings using videoconferencing and share a single set of artifacts such as a single, cloud-based product backlog.In other words, the integrated model entails one team with members in different physical locations.
An obvious problem with the isolated model is that sites may not communicate enough to maintain group cohesion and shared goals.An obvious problem with the integrated model is that, the greater the time zone differences between sites, the more difficult it will be to schedule meetings [168].
These three models can be further tailored to specific contexts (creating "methods-inaction" [46]).Further, as the tables above indicate, teams may conduct certain meetings jointly, and others separately.For example, a team may conduct a synchronized sprint planning meeting, which involves the joint participation of project members across different sites, via video conference.Given the number of standard events (Table 3), many different combinations and thus "models" of Distributed Scrum may exist in practice.Evaluating Distributed Scrum as a single method is impossible; rather, we must consider its constituent roles, artifacts, events, and practices.

Prior Reviews of Agile Methods in Distributed Software Development
Several systematic literature reviews have investigated the use of agile methods in distributed software development.Table 4 presents an overview.A recurring theme in the findings of these reviews is that using Scrum in distributed settings comes with major challenges, which jeopardize project success.Noted challenges include: -communication, personnel selection, work culture, different time zones, trust, and knowledge management [21] -communication, collaboration, tool support, and team management [67,94] -communication challenges including people, distance, team, technology, architecture, process, and customer-related [5, 6] -communication, collaboration, coordination, and cultural differences [131].
Meanwhile, Vallon et al. presented an extensive review of agile practices used in distributed settings [163], while Dreesen et al. [42] reviewed communication practices in distributed agile projects.
Several of the reviews provide recommendations to address some of these challenges.While many of these recommendations appear insightful and may be practically useful to managers, none of these reviews systematically investigated which Distributed Scrum practices are associated with project success.They mainly used qualitative synthesis methods such as thematic analysis.The present study differs from previous reviews in that it quantitatively tests specific hypotheses concerning the effectiveness of specific practices.

CASE META-ANALYSIS
To address our research question, we conducted a case meta-analysis.Case meta-analysis is especially appropriate for our research question at this time, for three reasons: (1) A sufficient mass of published cases now exists to facilitate quantitative analysis.(2) Scrum remains popular among both co-located and distributed teams.
(3) We aim to test specific hypotheses.
The case meta-analysis method was originally proposed by researchers at the Rand Corporation as a "way to aggregate existing research" [97].It was subsequently elaborated in management [25,91] and more recently in software engineering [100,117].Our approach was mainly guided by the management guidelines [24], then refined post hoc based on Melegati et al. 's guidelines [100] and the Case Survey Empirical Standard [125].
Most of the research on Distributed Scrum takes the form of case studies and experience reports, which report predominately qualitative data.There are several ways of synthesizing such research [35], most of which are qualitative in nature; that is, researchers generate higher-level themes that describe prior research (see, for example, Hossain et al. [67]).
Case meta-analysis is neither intrinsically superior nor inferior to meta-synthesis; they just pursue different aims using different techniques.Meta-synthesis is used to organize research into themes; case meta-analysis is used to test hypotheses [78,91].Case meta-analysis is widely used in management and information systems [79] and increasingly in software engineering [100].For example, case surveys have been used to investigate strategic pivots at software start-ups [10] and how organizations select component sourcing options [118].Overall, there has been con-Distributed Scrum: A Case Meta-analysis 100:7 siderable work on case meta-analysis, which reinforces the method's consistency and validity, especially considering its importance in identifying and statistically testing patterns across case studies [79,91,100].
To conduct the case meta-analysis, we performed the following steps: (1) Generate hypotheses based on prior literature; (2) Identify primary studies using keyword searches and reference snowballing; (3) Create a coding scheme focusing on facts and minimal interpretation; (4) Survey the original authors of the selected primary studies for additional information; (5) Extract data from the studies based on the coding scheme; (6) Test the hypotheses.
The remainder of this section describes each of these steps in more detail.

Step 1: Generate Hypotheses
We developed six hypotheses based on the Distributed Scrum and global software development literature.Several tactics for generating research questions (or hypotheses) have been proposed; for example, facet analysis considers dimensions of constructs of interest and their properties [25].
A second tactic is Bullock and Lawler's model, which considers three main questions: what was done, how was it done, and in what situation was it done [25].We took a third approach: the "homegrown model" [25], which may rely on prior theory or on the researchers' experience in a field.
Prior to the study, two of the authors were familiar with the literature on distributed software development.One of them had conducted an unpublished pilot study with a graduate student on a smaller sample of articles.Early discussions fed into a series of four workshops with the whole research team, each of which took a few hours.The hypotheses were developed and refined during these workshops, informed by prior literature on distributed software development [2,26,58].
Previous research has categorized issues in global software development into communication, coordination, and control [2].As projects spread across greater geographic, temporal, and cultural barriers, teams tend to experience more problems communicating and coordinating while managers struggle to control the development process.Furthermore, an important determinant of success in a software project is trust [41,102], because distributed environments are often characterized by "little or unpredictable communications, lack of conflict handling, and disparities in the work practices" and fewer face-to-face interactions that typically help people build trust [102].Insufficient trust among team members undermines developing cohesive work practices [41].An organization's culture holds it together; "it is related to the institutionalized way of thinking and acting of people" [115].In software companies, culture can make or break the business because it defines the collective behavior of everyone, and also how the organization is built and how it functions.Our first hypothesis captures these various themes, as follows1 :

Hypothesis 1. Software project success is associated with fewer: (A) communication issues, (B) coordination issues, (C) control issues, (D) cultural issues, and (E) trust issues.
Next, we consider factors that might affect each of our hypothesized antecedents of success.Intuitively, the more sites that are involved in collaborative development, the more difficult maintaining good communication will be.Similarly, greater time differences between sites should harm communication by hindering the scheduling of synchronous meetings, as there is less overlap in working hours of people across different sites.The way teams perform daily stand-up meetings should also affect communication.Teams using Scrum-of-Scrums should have fewer communication problems than teams using isolated stand-up meetings because of the additional information sharing between Scrum Masters in the Scrum-of-Scrums.
Longer sprints may undermine communication, because team meetings (other than stand-ups, which usually take place on a daily basis) typically happen once per sprint.A longer sprint duration implies that more communication breakdowns can happen before teams re-synchronize.In contrast, synchronous sprints (i.e., each site's sprint starts on the same day and has the same duration), synchronous meetings, and distributed project management tools should improve communication.Likewise, if all sites have the same definition of done (i.e., the same delivery pipeline and a shared model of who is responsible for what), then they should have fewer miscommunications.Hypothesis 2 summarizes these issues:

Hypothesis 2. Communication issues are directly proportional to: (A) number of sites, (B) temporal difference, and (C) sprint length. Communication issues are inversely proportional to: (D) synchronized sprints, (E) synchronized stand-ups, (F) synchronized sprint planning, (G) synchronized sprint reviews, (H) synchronized retrospectives, (I) shared definition of done, and (J) distributed project management tools.
While synchronizing sprints and meetings should help with communication, they may exacerbate coordination problems, because coordinating meetings across sites and time zones is intrinsically challenging [135].For the same reason, totally integrated stand-ups (i.e., all staff in all locations attend one big, synchronous stand-up meeting) may cause more coordination issues than a Scrum-of-Scrums [157].
Intuitively, coordination difficulty should grow with the number of sites and staff.The greater the temporal differences, the more challenging the coordination issues will be [168].Finally, having a shared definition of done should help with coordination.Hypothesis 3 captures these issues:

Hypothesis 3. Coordination issues are directly proportional to: (A) number of sites, (B) temporal differences and (C) synchronized sprints, (D) synchronized stand-ups, (E) synchronized sprint planning, (F) synchronized sprint reviews, (G) synchronized retrospectives, and (H) total staff. Coordination issues are inversely proportional to: (I) shared definition of done and (J) distributed project management tools.
Intuitively, the more people one manages and the more sites they are spread across, the more difficult it will be to maintain control [103].While Scrum de-emphasizes centralized managerial control in favor of self-organizing teams, Scrum teams can still experience control issues.For example, the team may adopt a particular coding standard, a test-first unit testing strategy, or a noovertime policy, but an individual team member might disagree with and ignore these choices.The more people involved, the more likely individuals will go against team decisions; the more sites they are spread across, the more difficult it will be to detect and manage any deviant behavior.This leads to our next hypothesis: Hypothesis 4. Control issues are directly proportional to: (A) total staff and (B) number of sites.
Hofstede (and later, Hofstede's son) were among the first to isolate dimensions of culture empirically [59][60][61], resulting in the six dimensions listed in the hypothesis below (see Table 5).The greater the cultural diversity of a team, the more cultural issues we expect the team to experience.This does not mean that more diverse teams are less effective.The relationship between diversity and team performance is complex and depends on many moderating factors [65,66].However, Distributed Scrum: A Case Meta-analysis 100:9 "The extent to which a society accepts the fact that power in institutions and organizations is distributed unequally" [59].
Individualism "Loosely knit social framework in which people are supposed to take care of themselves and of their immediate families only" [59].

Masculinity
"The extent to which the dominant values in society are 'masculine'-that is, assertiveness, the acquisition of money and things, and not caring for others, the quality of life or people" [59].
Uncertainty Avoidance "The extent to which society feels threatened by uncertain and ambiguous situations and tries to avoid these situations by providing greater career stability, establishing more formal rules, not tolerating deviant ideas and behaviors, and believing in absolute truths and the attainment of expertise" [59].

Long-term Orientation
"related to the choice of focus for people's efforts: the future or the present and past" [61].
Indulgence "related to the gratification versus control of basic human desires related to enjoying life" [61].
it seems reasonable to hypothesize that, despite their many advantages, more culturally diverse teams will have more cultural conflict and inter-cultural misunderstandings.However, software engineering case studies rarely discuss the cultural backgrounds of individuals.The only commonly reported indicator of culture is site locations.Therefore, to examine culture, we consider differences between the national cultures of the countries in which the sites are located, as follows. 2 Section 3.3 explains this analysis in more detail.

Hypothesis 5. Cultural issues are directly proportional to the maximum difference, among the countries in which the sites are located, in each of: (A) culture power distance, (B) individualism, (C) masculinity, (D) uncertainty avoidance, (E) long-term orientation, and (F) indulgence.
Finally, the more people who are involved in a project, the more difficult it will likely be to build and maintain trust [3].Conversely, physical visits between sites should help build trust [9,58], as should airing and resolving grievances during synchronous retrospective meetings (with all sites participating).
Staff dispersion imbalance refers to the degree to which staff members are spread unevenly; that is, a higher level of imbalance indicates a larger difference between larger and smaller sites.In this study, we measure staff dispersion imbalance as the difference between the number of staff at the largest site and the number of staff at the smallest site.For instance, if the project is distributed across three offices with 12, 10, and 8 members, then we calculate staff dispersion as 12 − 8 = 4.We hypothesize that greater staff dispersion imbalance is associated with more trust issues because staff at smaller sites may feel like outsiders or may feel unheard.Meanwhile, the more sites staff are spread across, the more difficult building trust will be; trust issues might also be more prevalent if development happens across many different sites [74].Site visits should help staff to build rapport and mutual understanding, and help managers to gauge morale, and become more familiar with their employees' work styles, capabilities, and interests [58].Moreover, synchronized retrospective meetings, in which teams identify and resolve conflicts, should also help to improve trust.Hypothesis 6 summarizes these issues.Hypothesis 6. Trust issues are directly proportional to: (A) total staff, (B) imbalanced staff dispersion, and (C) number of sites.Trust issues are inversely proportional to: (D) shared definition of done, (E) site visits, and (F) synchronized retrospectives.

Step 2: Identify Primary Studies
To identify articles that report case studies of Scrum in distributed settings, we started with an online search in Scopus.Scopus is one of the most comprehensive databases available [98] and, as such, indexes most venues that we could expect to have published relevant papers.We used the following search string, which resulted in a set of 435 papers (Figure 2): TITLE-ABS-KEY ( ( "scrum" OR "agile" ) AND ( "dispersed" OR "global" OR "distributed" ) AND ( "case" OR "cases" ) AND "software" ).
The term "software" was included to exclude studies of agile methods in non-software settings (e.g., agile manufacturing).
We defined the following inclusion criteria: -the paper reports a case study or experience report; that is, an account of real events, investigated in a real-world context with no intervention; -the study reports on the use of Scrum, a Scrum hybrid (e.g., Scrum-ban), or an agile method heavily influenced by Scrum and containing features commonly associated with Scrum (e.g., sprint planning, daily stand-ups, retrospectives, product backlog); -the study reports on a distributed project; -the paper was published on or before December 31, 2020.
We also defined several explicit exclusion criteria: -studies of students or in educational settings; -studies using sample data such as cross-sectional surveys (e.g., Hummel et al. [72], Jain and Suman [73]); -papers that do not have sufficient information to code any of the variables but use distributed agile as a study context (e.g., [13,37,84]).
The selected literature was carefully and iteratively examined to identify sources that contained cases that met our criteria, while eliminating those that did not.We inspected each paper's title and abstract.We removed any paper that had no reference to agile methods, distributed development, or did not report a case study or experience report, leaving 106 papers.
As expected, none of the cases provided information on all variables.We sought to include as many cases as possible to maximize the potential for capturing data of interest to this study.The decision to exclude papers based on a lack of sufficient information was made when it appeared that almost none of the key variables could be coded, apart from demographic data such as the locations involved in a case.
We identified 32 more papers using exhaustive forwards and backward reference snowballing.We then performed another round of screening based on a more rigorous inspection of the full text of the paper.We also consulted previous reviews (see Table 4)  the review process, we identified several papers that reported on the same case study.For example, both the papers by Hossain et al. [67] and Bannerman et al. [12] report on the same four cases.We observed the same issue for some PhD dissertations, whereby one or more cases were also reported in a paper.
We completed the selection process with a final set of 88 selected sources reporting 119 cases (several papers report multiple case studies).Figure 2 summarizes the search process.Appendix A lists all included cases.

Step 3: Creating a Coding Scheme and Data Extraction
We created a coding scheme, implemented as a Microsoft Excel spreadsheet, to facilitate data extraction.The coding scheme includes variables corresponding to all of the hypotheses identified above, along with typical general information (e.g., paper title, year of publication).A coding scheme captures decision rules that describe how qualitative case descriptions can be systematically converted to quantified variables [91].Bullock and Tubbs describe two general principles to guide the creation of a coding scheme: (1) "simpler is better" and (2) "explicit written documentation is necessary [25, p. 189].We briefly discuss these principles.
A key decision involves defining the number of levels or categories for each variable.The process of converting qualitative descriptions into quantified variables involves categorizing findings, organizing cases into buckets for a variable of interest.Consider the variable project success; a twopoint scale would allow only to capture "success" (1) or "no success" (0).A five-point scale would allow for a considerably finer granularity: 5 = very successful, 4 = successful, 3 = neither successful nor failed, 2 = failed project, 1 = major failure.Unfortunately, it is frequently not possible to interpret qualitative description reliably into such a fine-grained classification.Even a five-point scale lends itself to false precision.Thus, we followed Bullock and Tubbs' recommendation to develop a simple coding scheme [25].We kept the scales for anything that requires interpretation coarse (e.g., "success" is coded 0 for no, 1 for mixed, 2 for yes, or NA for Not Available) to maximize reliability.Factual information, such as the number of sites involved in the distributed project, was recorded in its natural format.We also attempted to classify each case using the archetype models identified by Sutherland et al. [157]; Table 6 presents the distribution.
The second principle states that explicit written documentation is necessary.An initial coding scheme was created prior to the coding, and additional decision rules and notes were added  during the process when two coders disagreed.Decision rules both document subjective judgments necessary for the coding and guide future judgments.Appendix B includes the full coding scheme.
While most of our variables were extracted directly from the primary studies, quantifying cultural differences required some cross-referencing.First, we extracted the list of countries in which the sites were located (see Table 7).Then, we looked up each country in the Hofstede Insights Online Database 3 and recorded its values for each cultural dimension (power distance, individualism, etc.).To estimate cultural distance for a given dimension, we subtracted the lowest value from the highest value, as exemplified in Table 8.This approach makes the simplifying assumption that all people involved in the various cases work in their native country.

Step 4: Data Collection from Original Authors
Larsson [91] recommends having the original authors participate in the coding process for two reasons: (1) they can be employed as an independent, "third party rater for their own studies" [100]; and (2) they can provide additional data missing from their manuscripts.To collect information 100:13 from the authors, we created a questionnaire (see Appendix C) and sent it to all the authors of each primary study.We created two versions of the questionnaire, one for papers that report a single case and one for papers the report multiple cases.For the latter, authors effectively answered all of the questions for the first case question, then again for the second case, and so on.The collected survey data are available (see Appendix D).
For the single case papers, 102 authors were invited; 22 completed the questionnaire and 21 provided partial responses.For the multiple case papers, 37 authors were invited; 3 completed the questionnaire and 6 provided partial responses.In summary, we received responses for 31 out of 119 cases (26%), and 21 out of 89 papers (24%).

Step 5: Data Extraction and Multi-Rater Coding
Inter-Rater Agreement (IRA) refers to the extent to which two or more independent analysts, while assessing a property of an object, select the same point on a measurement scale.When IRA is high, we are more confident that the measurements are reliable (and the study is replicable).If IRA is low, then we worry that the measurements are not reliable (and the study is not replicable), possibly because of underlying construct validity or measurement validity problems.Therefore, we have multiple analysts independently extract the data.
To improve measurement validity, we extracted data over four rounds, stopping when Percent Agreement (PA) was no less than 80% for all items.(We explain below why we use percent agreement rather than an adjusted measure such as Krippendorff's α.)

Round 1:
Author Questionnaire.The first coding round involved the authors of the primary studies who provided information on 21 papers (see Step 4) and one of the authors who performed data extraction for the same set of 21 papers.We then assessed IRA using percent agreement (PA) and Krippendorff's α (see Table 9).IRA is plainly low here, which is unsurprising, because we can only extract what is reported in the papers, while the primary study authors can fill in missing data.Following the logic of primary study authors providing additional information, we resolved disagreements by simply accepting the data provided by the primary study authors.
As reported above, the response rate of the questionnaire was low.Some authors did not answer all the questions and some were not allowed to disclose information about case studies.In this situation, for example, when the original authors are not able to participate in the coding process, Beierle [16] suggests that two researchers independently code a set of cases and then compare results to check the level of agreement.
3.5.2Round 2: Independent Coders.We selected 20 of the remaining papers at random (using Microsoft Excel's random number generator).Two authors independently extracted data from these papers.We then compared their results and calculated percent agreement and Krippendorff's α (see Table 9).As expected, the results were better than in Round 1, but still short of our 80% stopping rule.
The two coders resolved disagreements through a discussion workshop.Prior to this meeting, we discussed the potential for power imbalances within the team to play out as bias during the reconciliation process, and the importance of genuine agreement rather than one coder deferring to the other.The reconciliation was an amicable process without any vociferous debate, so we do not think power imbalance played a significant role.However, these kinds of biases can be unconscious and difficult to quantify.Following Bullock's [25] recommendations, as we resolved disagreements, we added decision rules to our coding scheme not only to document our coding process but also to guide future, similar decisions.For example, we coded any case reporting multiple Scrum Masters coordinating regularly as following the Scrum-of-Scrums model.The decision rules are enumerated in Appendix D.

Rounds 3 and 4:
Reaching Consensus.We randomly selected another 10 of the remaining papers and two of the authors independently extracted data from these papers, calculated IRA, resolved disagreements, and captured more decisions rules.As we still had not reached our target agreement level, we repeated these steps one more time.This time, all items had at least 80% agreement (see Table 9), so we assumed the coding scheme was reliable and moved on with just one author extracting data from the remaining 19 papers.
The Round 4 column of Table 9 illustrates why using adjusted measures of IRA such as Krippendorff's α is impractical.Krippendorff's α is popular for assessing IRA because, unlike other coefficients, α it can be used for any number of observers, categories, measures, and values; it can be used with missing or incomplete data and is valid for large and small sample sizes [87].The value for α ranges from −1, indicating total disagreement, to 1 indicating total agreement (i.e., perfectly reliable).Krippendorff [88] suggested that α > 0.8 indicates sufficient reliability for scientific conclusions, while 0.8 < α < 0.667 supports "tentative" conclusions.
However, some characteristics of our data (e.g., binary variables, high missingness, skewed distributions) make the usual α standard of 0.8 nearly impossible to achieve.For example, even when the coders agreed on the assessment of site visits and success in 90% of cases, α is still zero.This problem is known as the "Paradox of Kappa" [45] (because it affects Cohen's and Fleiss' Kappa as well as Krippendorff's α).Expecting coders to agree 100% of the time is unrealistic, so we used percent agreement to assess IRA and report α for comparative purposes.

Step 6: Hypothesis Testing
As is common in case surveys, our dataset is too sparse to support multivariate comparisons with list-wise deletion (i.e., none of the rows were complete, so regression analysis will exclude all of the rows).While techniques exist to impute missing data, these become less reliable as datasets are 100:15

RESULTS
Table 10 shows the correlations between each type of issue and overall success.Recall that, since we consider the presence of issues rather than the quality of communication, coordination, and so on, we expect negative correlations.This section discusses each issue type individually.

Communication
Hypothesis 1A, that success is inversely proportional to communication issues, is supported (ρ = −0.710,p < .001).However, none of the correlations between communication issues (i.e., the presence of communication problems) and the hypothesized practices, such as synchronized standup, and properties of the distributed setting (such as the number of sites) are statistically significant (see Table 11).However, the p-value for Hypothesis 2A is .057,which is close to the predetermined alpha level of .05,and most cases reporting this variable involved only 2-4 sites (see Figure 3), so this relationship may be worth investigating further.We also investigated the relationship between the Distributed Scrum model (see Section 2) and the presence of communication issues.Figure 4 presents a descriptive plot, which suggests that communication issues are most prevalent when using the Isolated model and least when using the Scrum-of-Scrums model.A post hoc Kruskal-Wallis non-parametric test indicates that there is a significant difference (χ 2 = 11.400,df = 2, p = 0.003).Dunn's post hoc comparison test indicates significant differences between the isolated and Scrum-of-Scrum models (using the Holm correction to adjust p: p holm = .010)and between Scrum-of-Scrums and the integrated models (p holm = .019).

Coordination
Hypothesis 1B, that success is negatively related to coordination issues, is supported (ρ = −0.507,p < .001).However, Hypotheses 3A-J are not supported (Table 12).None of the correlations between coordination issues (i.e., the presence of coordination problems in a project) and the various hypothesized practices (e.g., synchronized Sprint Review), nor properties (e.g., number of sites) that could affect coordination issues are statistically significant.This means that synchronizing practices across different sites neither harms nor helps to address coordination issues.
To further inspect coordination issues by Distributed Scrum model as discussed in Section 2, Figure 5 shows the average score on "coordination issues" per model.While the figure suggests considerable differences among the Isolated model and the Scrum-of-Scrums model, a post hoc Kruskal-Wallis non-parametric test indicates this difference is not statistically significant (χ 2 = 1.323, df = 2, p = .52).

Control
Hypothesis 1C, that success is negatively related to control issues, is supported (ρ = −0.616,p < .001).However, Hypotheses 4A and 4B are not supported.Neither number of total staff nor number of sites have a significant correlation with control issues (see Table 13).Similarly, a post hoc Kruskal-Wallis non-parametric test indicated that there are no significant differences in control issues across the three three Distributed Scrum models (χ 2 = 4.609, df = 2, p = .10;see Figure 6).

Culture
Hypothesis 1D, that success is negatively related to cultural issues, is not supported (ρ = −0.304,p = .067).Further, Hypotheses 5A-F, which links Hofstede's six culture dimensions to cultural issues, are not supported.None of the correlations between cultural issues and Hofstede's culture dimensions (e.g., indulgence) that should affect cultural issues are statistically significant (see Table 14).
Figure 7 shows the descriptive plot for cultural issues for each of the three Distributed Scrum models.A post hoc Kruskal-Wallis non-parametric test indicates no significant differences (χ 2 ) = 2.989, df = 2, p = 0.224).

Trust
Hypothesis 1E, that success is negatively related to trust issues, is supported (ρ = −0.626,p < .001).However, Hypotheses 6A-F are not supported (see Table 15).There was not enough data to test Hypothesis 6E, which links number of site visits to the presence of trust issues.It is worth noting that while no significance was observed for H6D, shared definition of done, the sample size (n = 8)   is small and the p-value fell between .05 and .10.The possibility that a shared definition of done promotes trust may therefore warrant further investigation.Figure 8 shows trust issues averages for each of the three Distributed Scrum models (see Section 2).A post hoc Kruskal-Wallis non-parametric test indicated no significant differences (χ 2 = 3.209, df = 2, p = .201).

Post Hoc Tests
Above, we propose a two-level model in which various practices and variables affect success through one or more issues categories.We formulated the model this way to tease out paradoxical effects (e.g., synchronized meetings could have been directly related to coordination issues while Distributed Scrum: A Case Meta-analysis 100:19  inversely related to communication issues).However, some Distributed Scrum practices might affect success directly or through another, unknown mediator.To examine this possibility, we computed correlations between all of the exogenous variables and success.No new significant correlations were evident.Further details of this analysis are available in the supplementary materials.

DISCUSSION
We hypothesized that success with Distributed Scrum was inversely related to problems with communication, coordination, control, culture, and trust; and that myriad variables including using Distributed Scrum's various practices would be inversely related to one or more problems.The first half of our model was mostly supported; the second half mostly not.At least three explanations can account for this pattern: (1) our analysis is flawed; (2) the case study reports we reviewed are flawed; or (3) Distributed Scrum is not a "secret sauce" [154] to ensure project success.We discuss each of these in turn.

Limitations of This Study
We carefully followed established guidelines for case surveys and made available a detailed replication package such that researchers can independently audit or recreate our work (see Appendix D).We also studied previous case surveys in other domains [16,91].Nevertheless, this study is limited in several ways.Case meta-analysis is a positivist approach based on null hypothesis testing and therefore subject to typical positivist quality criteria.Our use of multiple coders, rounds of analysis, dispute resolution meetings, decision rules, and analysis of inter-rater agreement are common practices for improving reliability but cannot guarantee perfect coding.
The simplicity of our analysis supports high conclusion validity.Spearman correlation and the Kruskal-Wallis Test make minimal assumptions (e.g., independent observations) and do not assume normality.The main problem with this approach is that performing many statistical comparisons could inflate positive results, which we mitigate through conservative interpretation.Furthermore, this kind of study cannot demonstrate that a Scrum practice causes fewer issues, or causes projects to succeed, so internal validity is either low or inapplicable, depending on one's perspective.However, it can demonstrate that the Distributed Scrum practices do not cause success, because causality requires correlation.
We endeavored to identify all relevant cases within the stated publication time frame and can therefore make a breadth argument for generalizability; however, primary study authors do not select sites at random, so we cannot make a randomness argument for generalizability [11].We do not know how projects that become case studies differ from projects that do not.One thing that is obvious from Table 7, though, is that Africa, the Middle East, South America, and most of Asia are under-represented in the primary studies compared to Europe, North America, and India.
Furthermore, since our initial keyword searches were performed on the Scopus database, articles not indexed by Scopus are less likely to be included.We partially mitigated this threat using reference snowballing.Moreover, our search string and selection criteria may have led to omitting studies in which some Distributed Scrum practices were used in non-scrum, non-agile projects.
Regarding construct validity, we note the following potential issues: -Since no definitive, universally accepted account of Distributed Scrum exists, different researchers could define it differently and therefore come to different conclusions.-We estimate cultural differences by comparing national cultures, not the cultural backgrounds of individual team members.Being located in Zurich does not mean every team member is Swiss, and even if they were, individuals can have different degrees of power distance, indulgence, and so on, than national averages.We relied on Hofstede's culture dimensions, which are controversial [145].-The various issue variables do not measure the degree of each issue at the site, because case meta-analysis does not support such measures.Rather, we indicate whether the case report mentions or implies each class of issues.-We do not measure success; we simply record whether primary study authors state or imply that the projects were successful, unsuccessful, or mixed.-Our staff dispersion metric does not take into account the number of sites and has not been independently validated.-Data collected directly from primary study authors is subject to the limitations of human memory.

Limitations of the Primary Studies
Furthermore, systematic problems with the primary studies could bias our results.We did not attempt to exclude low-quality studies for two reasons: (1) No objective nor widely accepted basis for measuring case study quality exists.There are attempts (e.g.Ralph et al. [125]) but they are not widely accepted.
(2) Case meta-analysis synthesizes basic facts rather than nuanced interpretations.A highly rigorous case study is not necessary to report accurately most of the data we analyze (e.g., the team had sites in Vancouver, Jakarta, and Pretoria; they experienced communication problems).That is why case surveys include experience reports [25].
Anecdotally, the quality of papers appears mixed.Many primary studies appear quite rigorous; others less so.Many primary studies do not provide rich descriptions of their sites and the development practices they use.For example, 21 out of 119 cases did not explain how stand-up meetings were conducted (i.e., isolated, distributed, or Scrum-of-Scrums) and only a quarter indicated 100:21 how many people work at each site.It is not enough to state that a team "uses Scrum, " because that means different things to different people in different contexts.Case studies about software development methods and project management frameworks need vivid, detailed descriptions of developers' ways of working to facilitate transferability and secondary synthesis.

Is Distributed Scrum Effective?
Notwithstanding the limitations discussed above, do our results justify the conclusion that Distributed Scrum has no effect on overall project success?There are two issues that warrant discussion.
First, this is a correlational study.Identifying an inverse correlation between synchronized sprints and coordination issues, for example, is not sufficient to demonstrate that synchronizing sprints cause better coordination.However, the absence of a correlation does imply the absence of a causal effect, because causation requires correlation.Therefore, our results do warrant the conclusion that, with the possible exception of the Scrum-of-Scrums model, the practices that constitute Distributed Scrum are not antecedents of project success and do not ameliorate communication, coordination, control, cultural, or trust issues.
Second, does "most practices associated with Distributed Scrum are ineffective" imply that "Distributed Scrum is ineffective"?Because different teams adopt different combinations of Distributed Scrum practices, they cumulatively act like a within-subjects quasi-experiment.This allows us to model, quantitatively, the effectiveness of each practice comprising Distributed Scrum, such that the teams that did not adopt some practice P act as a control group for the teams that did adopt P. Therefore, showing that almost none of its practices are correlated with success is strong evidence that Distributed Scrum is ineffective, because the only alternative explanation-that Distributed Scrum works independently of the practices constituting it-is not credible.
Our results are therefore consistent with the proposition that there is no evidence that Distributed Scrum enhances project success.

Implications for Practice
Our findings have three main implications for practitioners: (1) Issues with communication, coordination, control, and trust are inversely related to success, and therefore should not be ignored.Project failure happens after issues have arisen; i.e., project failure did not cause these issues, so it is more likely that these issues cause failures, unless some third variable accounts for the correlation.Unfortunately, Distributed Scrum seems ineffective for addressing these issues.(2) The Scrum-of-Scrums model, in which each site has its own meetings and then a representative from each site meet to synchronize and coordinate is associated with better communication.Our findings support recommending this approach.(3) Unfounded claims that evangelize Distributed Scrum as a "secret sauce" should be treated with great caution.It is incumbent on those making a claim, such as "Distributed Scrum [is] the secret sauce for hyperproductive offshored development teams" [153], to provide evidence supporting that claim-this study suggests that Distributed Scrum is not a secret sauce.The balance of evidence does not support the effectiveness of Distributed Scrum.
This last point bears elaboration.The commercial complex of agile consultants, coaches, educators, and content creators have clear business interests in widespread adoption of agile methods.Agility is a team's ability to respond and adapt quickly [32].Proponents of agile methods did not create a scale to measure team agility and run a series of studies showing that adopting their various practices increases agility.Instead, they labeled their lightweight approaches to software development "agile" and subtly redefined "agility" as the degree to which teams adopt practices that are intended to increase agility, leading scholars to bemoan the problem of "doing Scrum rather than being agile" [173].Wufka and Ralph noted that: "Labeling as agile every method, practice and value that is intended to increase agility undermines the responsibility to empirically evaluate whether these things actually cause nimbleness and responsiveness to increase" [170].
Meanwhile, some organizations are making a great deal of money from training and certifying agile and Scrum professionals.
Our comments above should not be misinterpreted as a wholesale critique of agile methods.We do think agile methods can help organizations, if used appropriately, which may involve tailoring to the context at hand.Rather, we observe a "cargo cult" approach to agile method adoption: adopting Scrum practices, such as the daily stand-up, without appreciating their purpose or respecting their underlying principles, will only cause frustration and misplaced critique of agile methods, without truly attempting to become agile.

Implications for Researchers and Avenues for Future Research
We are not surprised that the various practices comprising Distributed Scrum have little effect, but we are surprised by the seeming irrelevance of number of staff, number of sites, staff dispersion, and temporal differences.It seems intuitively obvious that larger projects spread across more sites in more time zones would be more difficult to manage.However, again, perhaps the importance of these variables has been overstated.This raises a critical question for future research: if neither practices popularized by Scrum nor the dispersion of staff across time and space drive success, what does?Indeed, we could ask the same question of any development approach, for example, the more recent trend of continuous software engineering [48].When software engineering research investigates success, it tends to adopt an engineering perspective; that is, developing ostensibly helpful technical artifacts (e.g., programming languages) or socio-technical practices (e.g., pair programming, test-driven development).It assumes that success is predominately driven by technical and socio-technical factors.But there are other perspectives, and embracing them might produce better explanations.
For example, an economic historian might expect success to be driven by the balance of power or resources among competitors (i.e., the competitor with the most resources usually wins).From an entrepreneurial perspective, we might expect the most innovative organization, or the one with the best leadership (cf.Gren and Ralph [53]), to win.From a management science perspective, we might examine the ratio of resources required to resources available and expect the best-resourced projects to succeed.From a design perspective, we might expect the process of determining what features to build, which is essentially ignored by Scrum, to determine the success of the product.A psychologist might expect the most cohesive team to succeed.We do not have the answer; rather, we suggest approaching the issue from a great diversity of perspectives to develop a better theory of success in distributed software projects.
Regardless of the chosen perspective, more research is clearly needed to generate a testable and comprehensive theory of software-engineering success.More research needs to evaluate how proposed artifacts, tools, techniques, and so on, affect overall project success, rather than being content with mediating variables (e.g., fault density, design quality, requirements traceability) when we do not really know how important these mediating variables are for what we really care about: the overall success of the project or product.

CONCLUSION
In summary, we conducted a case meta-analysis-a quantitative method of testing hypotheses on data extracted from qualitative case studies-to examine the antecedents of success in software projects that use Distributed Scrum.We found: -evidence that the presence of issues with communication, coordination, control, and trust are inversely proportional to project success; -no evidence that any specific practices associated with Distributed Scrum are effective in reducing these issues, except that the Scrum-of-Scrums model is associated with better communication; -no evidence that differences between the national cultures of sites are associated with cultural issues or that cultural issues affect success; -no evidence that the numbers of staff or sites or distribution of staff across sites has any significant effects.
Our results suggest that, on balance, Distributed Scrum neither affects project success nor mitigates communication, coordination, control, trust, or cultural problems.Without any empirical evidence that Distributed Scrum enhances success, the continued evangelizing of Distributed Scrum is rooted in dogma, not science.Scrum proponents may counter that Scrum helps some other important dependent variables or criticize our study in some way; however, it is their job to provide evidence to support their claims of effectiveness and they have not done so.
However, our findings should not be misconstrued as a wholesale rejection of agile methods or, worse, endorsement of pre-agile (e.g., plan-driven) methods.Plan-driven approaches may very well negatively affect success.What we wish to repudiate is the tendency to take action based on the evidence-free proclamations of agile advocates and consultants; what we wish to endorse is evidence-based practice [83].
Some practices associated with agile methods may deliver substantial benefits when used appropriately (which may involve tailoring to the context at hand).Rather than thoughtfully adopting practices supported by evidence, we often observe a cargo-cult approach: teams adopting Scrum practices such as the daily stand-up without appreciating their purposes or underlying principles, leading to developer frustration and misplaced critique that the practices "don't work." Pair programming, for example, has limited uptake despite a substantial base of supporting evidence (cf.Zieris [172]) while daily stand-ups are ubiquitous despite no clear evidence that their benefits outweigh their drawbacks.Other teams adopt a performative approach; for instance, "estimating" stories after they have been completed because some external manager or client demands estimates.
Looking forward, more research is clearly needed to identify the main drivers of success in distributed software development, and current trends toward working from home, increasing the dispersion of developers further, make this issue all the more pressing.[171] Optimeter I SoS Spain 1 116 Yague 2016 [171] Optimeter II SoS India, UK, Italy, Finland 1 117 Yague 2016 [171] Research4us Integrated Spain 1 118 Yague 2016 [171] Habeo Ideam Integrated Spain 1 119 Zieris 2013 [173] na Integrated Poland, Germany 1 a na: information not available.b Not specified which country.c Additional sources such as a MSc or a PhD dissertation also reported on the case.d SoS: Scrum-of-Scrums.e The paper by Sharp states that interviewees were based in several countries, including Australia, Brazil, India, Mexico, Poland, and the USA, but did not specify this information per case.f Several cases were not named, but the organization could be derived from the paper, for example, through the author's affiliation.Successful.Note that a project can still be a success despite having all sorts of issues; this depends on the coder's judgment of the severity of the issues that were reported in the case.Further, a project can evolve to become successful after, for example, changes to how the project is run.

Communication issues
No issues.Some issues.Considerable issues.

C AUTHOR SURVEY
The following survey was sent to authors of the cases that we identified, inviting them to answer the following questions:

D DATA AVAILABILITY
We have shared a JASP file that contains both the data and analyses.JASP is an open source statistical package [76].The data file is available on an Open Science Foundation page: https:// osf.io/ux9s5/?view_only=82636ca82fb74433afec5157fd7d54cd.Furthermore, the responses of the author survey are also available for download on the same OSF page.

Table 3 .
Distributed Scrum Events

Table 4 .
Overview of Previous Reviews on Agile Methods in Distributed Software Development

Table 5 .
Hofstede's Cultural Dimensions [163]h as Vallon et al. 's[163], to cross-check our selection of papers.Based on this step, no further papers were identified.During

Table 6 .
Frequency of Distributed Scrum Models Observed

Table 7 .
Site Locations

Table 11 .
Predictors of Communication Issues more sparse.We therefore used Spearman's ρ correlation to test our hypotheses individually.The main risk of this approach is false positives due to a large number of comparisons.However, as we report below, this can be mitigated through careful interpretation.The analyses were performed in the open source statistical package JASP [76]; the JASP file is available (see Appendix D).

Table 12 .
Predictors of Coordination Issues

Table 13 .
Predictors of Control Issues

Table 14 .
Predictors of Cultural Issues

Table 15 .
Predictors of Trust Issues * Insufficient data.

Sprint Length Q1: What Scrum model would you use to describe how the teams were working?
Isolated Scrums -Teams are isolated across geographies Distributed Scrum of Scrums -Scrum teams are isolated across geographies and integrated by a Scrum of Scrums that meets regularly across geographies Totally Integrated Scrums -Scrum teams are cross-functional with members distributed across geographies

What was the total number of site visits
? (That is, times that participants visited an office other than their main office during the period of the case study?Approximations OK.)

how many offices were the studied teams distributed? Q12: What were the total number of staff on all locations? Q13: How many people work at the location with the largest number of people? Q14: How many people work at the location with the smallest number of people
? ACM Computing Surveys, Vol.56, No. 4, Article 100.Publication date: November 2023.Q15: