Outcome-based Evaluation of Systematic Review Automation

Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant and not relevant publications. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the systematic review. More specifically, if an important publication gets excluded or included, this might significantly change the overall review outcome, while not including or excluding less influential studies may only have a limited impact. However, in terms of evaluation measures, all inclusion and exclusion decisions are treated equally and, therefore, failing to retrieve publications with little to no impact on the review outcome leads to the same decrease in recall as failing to retrieve crucial publications. We propose a new evaluation framework that takes into account the impact of the reported study on the overall systematic review outcome. We demonstrate the framework by extracting review meta-analysis data and estimating outcome effects using predictions from ranking runs on systematic reviews of interventions from CLEF TAR 2019 shared task. We further measure how closely the obtained outcomes are to the outcomes of the original review if the arbitrary rankings were used. We evaluate 74 runs using the proposed framework and compare the results with those obtained using standard IR measures. We find that accounting for the difference in review outcomes leads to a different assessment of the quality of a system than if traditional evaluation measures were used. Our analysis provides new insights into the evaluation of retrieval results in the context of systematic review automation, emphasising the importance of assessing the usefulness of each document beyond binary relevance.


INTRODUCTION
A systematic literature review is a well-established and rigorous methodology for synthesising and evaluating the evidence on a specific research question, which is particularly important in the field of medicine [13].However, it is also gaining importance in other areas such as social sciences and engineering [2,3,17,28].The process involves a systematic search, critical appraisal, and synthesis of the available literature on a topic.During the critical appraisal step, every included publication has its weight and effect calculated based on the outcomes reported by that publication.This information influences the final outcome of the review.
One of the essential steps in conducting a systematic review is the process of citation screening, in which a large number of publications are initially identified through a literature search and then screened to determine those relevant to the review [1,33].This process can be time-consuming and labour-intensive, involving making thousands of eligibility decisions.Given the importance of citation screening in systematic literature reviews, there have been numerous attempts to automate the process [27].Previous studies have investigated the use of automated citation screening methods for systematic literature reviews by utilising various natural language processing (NLP), machine learning (ML), and information retrieval (IR) methods to retrieve, rank, or classify references [11, 16, 19, 27, 29-31, 34, 38-40].
To understand the effectiveness of automated citation screening methods, practitioners have relied on metrics based on the notions of recall, precision and cost -and of a binary assessment of relevance 1 [19,27,34].This practice assigns to every publication to be included in the review the same importance.So, for example, if method  1 identifies as potentially relevant publications {, , } while method  2 identifies publications {, , }, and the ground truth is that the relevant publications are {, ,  }, then  1 and  2 achieve the same recall, precision and cost.However, we argue, that the two sets {, , } and {, , } may not be equally important, and thus identifying either of  or  may not be equivalent if the outcomes of the review were considered.In fact, if excluded, some publications can significantly change a review's conclusion to the extent that the conclusion might be the opposite (e.g., from favouring a drug to favouring a placebo) [25,26].On the other hand, not including other publications might have only a small quantitative impact on the outcomes of the review.
We argue that a holistic evaluation of retrieval and automated citation screening methods for systematic review creation should not only consider the concepts of recall, precision and cost, but also the quality of the outcomes generated from the analysis of the automatically included publications.Following this direction, we propose a new evaluation framework that considers inclusion and exclusion information and meta-analysis data from reviews created by Cochrane -the largest organisation responsible for creating systematic literature reviews in medicine, 2 to estimate outcomes and weights of included publications.This information can be used to assess the quality of ranking and classification methods.This framework allows for assessing automatic approaches from the angle of how closely their outcomes -not just their set of included publications -are to the outcomes of the original review.By comparing the outcomes of the automated model to those of the original review, we can gain a better understanding of the quality of the automated approach and its effect on the final outcome of the review.
We propose five aspects of analysis focusing on different features of review outcomes.We explore initial experiments on the CLEF TAR 2019 dataset [16].Our simulation results show that by randomly removing one publication per review (average recall of 92% publications), 95% of outcomes remain unchanged.However, after removing five publications (average recall of 63%), 76% of the outcomes are still the same, showing that the relationship between recall and achieved outcomes is not linear.We also show that the outcome-based evaluation emphasises different aspects of the models' performance than the traditional IR evaluation measures.We finally propose multi-objective optimisation to handle the problem of non-estimable outcomes.
We believe this new evaluation approach will provide a better understanding of the impact of automatic literature screening methods on the outcome of systematic literature reviews and help identify areas in which these methods can be improved.

RELATED WORK
The effectiveness of automatic approaches for search strategy creation and systematic review screening has been traditionally evaluated using binary relevance ratings [16,27,34], often sourced at the title and abstract screening level, rather than at the full-text level.
When the screening problem is treated as a ranking task (e.g., for the sub-task of screening prioritisation or stopping prediction), then rank-based metrics and metrics at a fixed cut-off are commonly used, e.g., @, @, @, last relevant found [12,32].Cost-based and economic-based metrics are also used, especially in the context of the query formulation task in the CLEF TAR shared task [14][15][16], e.g., total cost (TC) or total cost with a weighted penalty (TCW).The TREC Total Recall track [9] also used a cut-off based metric, @ +, which is defined as the recall achieved when  + documents have been identified, where  is the number of relevant documents in the collection and  and  are parameters.When  = 1 and  = 0, @ +  is equivalent to R-precision.In the patent domain, the PRES score has been proposed which takes into account achieved recall and the user's search effort [23].
When the screening problem is treated as a classification task, metrics based on the confusion matrix and the notion of Precision and Recall are commonly used [27,34]: aside from Precision and Recall, metrics include variations of the harmonised mean between the two, i.e.F  -score, utility, U19 [35][36][37], sensitivity-maximising thresholds [6], and AUC [4].Another metric, Work Saved over Sampling (WSS), measures the amount of work saved when using machine learning models to screen irrelevant publications [5,18,19,24].The True Negative Rate (TNR) and nreTNR (normalised rectified TNR) were proposed as an alternative as it addresses some of the limitations of WSS regarding averaging scores from multiple datasets [20,21].
Nussbaumer-Streit et al. [26] compared repeated literature searches using 14 abbreviated approaches (combinations of various databases with and without searches of reference lists) on a sample of 60 Cochrane systematic reviews of clinical interventions.They recalculated the main summary-of-findings table of each Cochrane review and asked original review authors whether the conclusions changed compared to the original review.They found that in only 2% of cases (95% CI: 0%-9%), combining one database with another or with searches of reference lists was falsely reaching an opposite conclusion compared to comprehensive searches.This outcome shows that identifying all relevant studies is not always crucial for obtaining the same review results.
Automated citation screening has become increasingly popular in systematic literature reviews due to its potential to reduce the time and cost required.However, current evaluation methods for these methods are limited to binary relevance assessment, where each publication is considered either relevant or irrelevant, and do not account for the impact of each publication on the review outcome.This is a vital issue, as the assumption that all relevant publications are equally important to the final outcome of the systematic review is not necessarily valid.Without an accurate assessment of the importance of each document, the conclusions of a systematic review may be biased or incomplete.To address this issue, in this paper, we propose a novel methodology for assessing citation screening based on evaluating outcome differences, which enables us to determine the impact of each publication on the systematic review.

EVALUATION FRAMEWORK
This paper proposes a new evaluation framework for automated citation screening.Our framework includes three steps which are detailed in the following subsections.The first step is data extraction, where we extract statistics of the studies included in the review and match studies to publications.The second step is model evaluation, where we use the extracted data to estimate the review's outcomes for rankings or classifications of the citation list.The third step is the analysis of the results, where we compare the outcomes obtained from the alternative rankings to the outcomes of the original review.Our proposed framework allows for a more nuanced evaluation of automated citation screening methods.By considering the impact of each publication on the review's outcomes, we can identify which publications are most important to retrieve and prioritise them accordingly.Next we describe each step in detail.

Data Extraction
Cochrane systematic reviews distinguish between study and publication.A study is a distinct piece of research conducted to answer a specific research question or investigate a particular hypothesis.It typically involves a group of participants, data collection methods, and specific objectives.Publications, on the other hand, are the atomic units which reviewers screen.Each study can be reported by several publications, such as journal articles, conference proceedings, or research reports.Each publication may present different aspects or findings of the same study, but they are all derived from the same underlying research.We assume that a study has been found if at least one publication reporting it was successfully retrieved.
For every review, based on its Cochrane review ID, we identify its corresponding RevMan file and list of included publications.A RevMan file is the format used by Cochrane containing all statistical data about studies and outcomes included in the review.Outcomes of Cochrane reviews are reported in the following hierarchy: one comparison can have several outcomes, and one outcome can consist of a few subgroups.We extract all metadata from the RevMan files, such as the comparisons, outcomes and subgroups and the results of every included study.Note that the use of RevMan files is for experimental convenience, but is not a requirement from the framework: the required data could be provided in other formats.Furthermore, Cochrane recently announced that future systematic literature reviews would contain statistical data in more common csv and ris formats. 3ochrane reports a list of included publications and studies which correspond to them.Traditionally, retrieval was conducted at the level of publications [14][15][16].In order to be able to re-use previous relevance judgments, we need to assign PubMed IDs to these publications.Our process for matching PubMed IDs to publications is based on four steps in the following order: • We check if the PubMed ID information is provided on the Cochrane references webpage.
• We conduct search in PubMed using Entrez4 by searching for the same title and authors.• We search for the PubMed ID in SemanticScholar5 using publication DOI from Cochrane references webpage.• We search again in PubMed, this time with a relaxed requirement by searching for an exact match in the title only.

Model Evaluation
When conducting a meta-analysis, for every outcome, each study has its weight and effect size calculated first (respectively columns 6 and 7 on example forest plots in Figure 1).Effect size is an essential statistical concept in the analysis of research data [10].It is a measure that quantifies the magnitude of difference between two groups in a study.Researchers use a variety of effect measures to compare outcome data between two intervention groups, including odds ratios and mean differences.For instance, in ratio effect measures, a value of 1 represents no difference between the groups [7,8].On the other hand, in difference measures, a value of 0 represents no difference between the groups.Values higher or lower than these "null" values may indicate either benefit or harm of an experimental intervention, depending on the order of the interventions in the comparison and the nature of the outcome.Every estimate is expressed with a measure of uncertainty, such as a confidence interval (CI) or standard error (SE).
Effects depend on the number of events reported by that study, whereas weights assigned to each study are influenced by other studies included in this outcome.So when removing one study from the meta-analysis, only the weights of the remaining studies will change, but their effect sizes will stay the same (compare Figures 1a  and 1c).There are several types of outcomes reported by Cochrane, in our study, we focus on the dichotomous and continuous outcomes only and calculate them following the approach by Deeks and Higgins [8].
Our framework takes arbitrary ranking or classification runs and calculates the final outcomes of the review based on publications included in the run.When evaluating a classification run or a search result, we take all publications predicted as relevant.When evaluating ranking runs, we need to assume a cut-off point.Previous studies working on systematic review automation used either cut-off at r% of recall [5,20], or at d% of total dataset size [14,15].

Results Analysis
We analyse the results by examining the outcomes generated by the run and compare them with the outcomes obtained by the original review (Figure 1).We extend the analysis done by Nussbaumer-Streit et al. [26], who proposed two categories of "changed conclusions": (1) if the new review drew the opposite conclusion, (2) if it is not possible to draw a conclusion or the new conclusion has less certainty.We distinguish five aspects of analysis for review outcomes against the original review (Figure 1a).The first two of these aspects are real-valued, whereas the remaining three are categorical:  (1) the study identifier, (2) number of events in the experimental group (e.g., patients with specific symptoms or adverse events), (3) experimental group size, (4) number of events in the control group, (5) control group size, (6) the weight of a study, and ( 7) effect size of a study: a difference (e.g., risk ratio or standardised mean difference) in events between experimental or control group.Simulations and figures done using RevMan Web, available at http://revman.cochrane.org.
(1) Magnitude of difference -By how much are the outcomes different in their effect size (Figure 1a versus 1b)?In other words, what is the numerical impact on the review outcome when certain studies are not included?This is measured by calculating the relative difference in effect size between the original outcome   and predicted outcome   :  = ∥  −  ∥ ∥  ∥ .When   = 0 and   ≠ 0, we assume  = 100%; otherwise, when   =   = 0, we set  = 0%.Similarly, when the predicted outcome cannot be estimated, we assume  = 100%.
(2) Distance from CI -Is the new outcome within the Confidence Interval (CI) of the original outcome (Figure 1c)?The answer is a distance between the predicted outcome   and the closest of the pair (  ,   ): (3) Overestimation/underestimation -Is the outcome overestimated or underestimated compared to the original one (Figure 1d)?We first check if the calculated outcome is equal (due to the limits of precision of data reported in RevMan files, we use the relative and absolute tolerance of 10 −5 and 10 −6 respectively).Then, if the outcome is different, we check if the result is higher than the original (overestimation) or lower (underestimation).The answer has three options: "overestimated", "underestimated", and "equal".(4) Sign -Does the outcome have the same sign as the original one (Figure 1e)?In other words, are the new conclusions opposite to the original ones?The answer is binary: "yes"/"no".This aspect corresponds to the first category from Nussbaumer-Streit et al. [26].(5) Estimability -Is it possible to calculate the outcome (Figure 1f)?
An outcome cannot be calculated if there are no included studies concerning it.The answer is binary: "yes"/"no".

EXPERIMENT SETUP
Contrary to the traditional evaluation based on retrieving relevant publications, with our framework we envision the evaluation in an outcome-based approach.Specifically, we do not treat a dataset as a collection of systematic reviews but rather a collection of outcomes.The problem of conducting a systematic review is multi-dimensional.One can think of it as having several outcomes reporting different dimensions of the review, and the evaluation of the user's needs is conducted independently from each outcome's perspective.We do not want to average across reviews, each containing a different number of outcomes.We add or average these outcome-level results instead.Before we present the results, we first discuss the used dataset and models.

Dataset and Models
We used a collection of 38 systematic reviews of interventions from the CLEF TAR 2019 training and test datasets [16].Each review consists of a Cochrane ID, a protocol, and a list of publications described by their PubMedIDs with qrels both on the title and abstract level and a full-text level.We enhanced the dataset by collecting RevMan files and information about the data and analysis as described in Section 3.1.
Out of 38 reviews in CLEF TAR 2019, our script found studies and outcomes for 32 reviews (17 in the training subset and 15 in the test subset).We summarise the statistics of the 32 reviews we consider in Table 1.There is a significant discrepancy in the number of outcomes reported by the reviews, ranging from as few as 2 or 3 outcomes in small reviews to 128 outcomes in the largest one.Moreover, the majority of these outcomes come from just one or two studies, which presents an additional challenge.
These 32 reviews report 1640 included publications, out of which we managed to find PubMed IDs for 1175 of them (71.6%).Next, we wanted to match publications identified with our script to the CLEF TAR 2019 qrels based on the PubMed ID.There were, in total, 778 relevant documents on the full-text level identified in the CLEF TAR for these 32 reviews.We successfully merged 741 publications (95.2% of the total in CLEF TAR); there are only 37 publications in CLEF TAR 2019 qrels which we do not have in our records.
We use 34 official CLEF TAR 2019 runs from three teams.The teams used a variety of ranking methods, including traditional BM25, interactive BM25, continuous active learning, relevance feedback, and various stopping criteria.Additionally, we included 40 runs based on the reproducibility of the active learning method by Yang et al. [41].In total, we evaluate 74 runs, but for the sake of brevity, in this paper, we present the results on a subset of 28 runs, as some of the runs were very similar to each other.Our model requires full-text assessments, and thus, we use qrels from the fulltext level, despite the fact that runs have been trained on titles and abstracts.While this might not be fair towards the evaluated systems, our experiments aim not to establish which systems are better but to provide an example of the operationalisation of our framework and its implications.

OUTCOME-BASED EVALUATION
We first run a simulation study to understand the results of our evaluation framework better in a controlled manner.Then, we discuss the usage of the evaluation framework with retrieval and classification runs on CLEF TAR 2019 collection.

Preliminary Simulation
We are interested in executing a preliminary study to understand the effect our outcome-oriented evaluation has on the analysis of systematic review automation methods.
We simulate the evaluation framework by taking the set of included publications for each review and randomly removing [1,2,3,4,5,10,15,20,30,50, 100] publications from the set and then recalculating the outcomes.In other words, we are interested in exploring the impact of false negatives on the final review outcome.We compare the outcomes with the 'gold' outcomes from the original review.Results from all 32 systematic reviews are reported in Table 2.In our analysis, we consider the metrics from all five analysis aspects (Section 3.3), as well as the Recall.
Figure 2 presents box plots of averaged relative difference (aspect ( 1)) values from our simulation at a cut-off at 20% of the total number of documents.These results validate our expectations regarding the behaviour of this aspect of analysis as the relative difference grows with the number of removed publications.On the other hand, the distance to confidence intervals (aspect (2), Figure 3) does not show any specific trend on the CLEF 2019 reviews.
Out of all the metrics, the one that changes the most when varying the number of removed publications is estimability (5).Note that the x-axis does not preserve the linear step.
As more publications are removed, it becomes more and more challenging to calculate outcomes, predominantly because half of the original outcomes relied on one or two studies.At the very extreme, when 100 publications are removed from every review, only 15% of outcomes are still estimable.The measure of overestimation and underestimation (3) is showing growing trends with more publications being removed.Already not including one publication per review (achieving an average recall of 92% for publications and 97% for studies) changed 38 outcomes (4.6% of the total number of outcomes).This shows that the commonly used threshold of 95% Recall does not enforce preserving the same outcomes of the review.We also notice that the sign (4) aspect is not very descriptive across the simulations as it is mainly influenced by non-estimable outcomes.

Evaluation with actual runs
In this section, we use the prediction on the test subset of the dataset from runs described in Section 4.1 and evaluate them using our framework.We further consider two baselines: gold -the best possible run which returns all relevant studies from the original review first.max-with-qrels -this run takes into account the limitations of the CLEF TAR collection and our PubMed articles matching process.It uses all relevant studies identified in the CLEF TAR 2019 qrels as relevant and places them first.
We follow the evaluation procedure of CLEF TAR and calculate the following traditional evaluation measures: Mean Average Precision (), last relevant found, Recall@k% of top-ranked publications, with k in [5,10,20,30,50], Work Saved over Sampling at r% of recall with r in [95%, 100%] ( @95%,  @100%), @20% of top-ranked publications and Area Under Recall Curve ( ).CLEF TAR as their primary reporting measure used ; therefore, we will treat  as the reference measure when sorting runs.We do not evaluate baselines with traditional measures, yet for the purpose of sorting, we assume that they achieved the highest MAP score.
We calculate the relative difference in study outcomes (analysis aspect (1) in Section 3.3) for every outcome in all reviews.The lower the average score is, the better the runs, as their effect differs less from the original review effect.As considered runs were rankings, we follow the same procedure as for Recall and nDCG, namely we calculate the relative difference at k% of top-ranked publications with k in [5,10,20,30,50].

Run name
Figure 4: Box plot presenting runs with their relative difference in study outcomes for an evaluation with a cut-off at 30% of the total number of documents for each review.Runs are sorted by their MAP score.The orange circle denotes the mean relative difference @30%.The X-axis is cut at 30, while the outliers exist up to the value of 100; we cut for visualisation purposes.Evaluation measure score Evaluation measure MAP Recall@5% Recall@10% Recall@20% Recall@30% Recall@50% WSS@95% Norm.Last Relevant Norm.Area nDCG@20% What is also interesting is that the mean relative difference at 30% cut-off for the max-with-qrels baseline run equals 6.24.Furthermore, for the relative difference score calculated at 100% of documents, this baseline score is also not equal to 0. This means that the limitations of the CLEF TAR collection and qrels establish a lower bound for the best achievable value of relative difference.
Figure 5 presents correlation between relative difference calculated at 20% cut-off of dataset size and evaluation measures used at CLEF TAR 2019.The score correlates positively with the last relevant found, but there is a negative correlation with all other 0.00 0.  (1) number of non-estimable outcomes on the x-axis and (2) sum of relative difference for estimable outcomes on the y-axis.Both objectives are to be minimised.Runs are evaluated at a cut-off at 5% of the total number of documents for each review.Non-dominated runs are marked with a blue colour.
measures.This confirms our intuition that a higher average relative difference score across outcomes means a worse model effectiveness, as the ideal 'best' model should achieve a difference of 0.

Pareto Frontier Optimisation
Based on the simulation results, we note a problem with nonestimable outcomes.Should these outcomes be assigned a zero score or maybe an infinite value?This raises the issue of handling these values in the evaluation process for calculating relative difference scores.In our study, we assigned a zero value to non-estimable outcomes, which allowed us to assume that the relative difference equals 100%.Nevertheless, this yields the problem of when the actual outcome is equal to the zero value (i.e., the study does not favour the experimental nor the control group), as the difference, in this case, would also be zero.One way to overcome the issue of non-estimable outcomes would be to evaluate both estimability and relative difference implemented, for instance, using the Pareto frontier [22].
Figure 6 presents the Pareto frontier evaluated at a cut-off at 5% of the total number of documents.On the x-axis, we show the number of non-estimable outcomes for each run.On the y-axis, there is a sum of relative difference for estimable outcomes.We min-max normalise the sums including the gold baseline run (gold represents the best achievable score of (0, 0)).Both objectives should be minimised, i.e., we want to have as few non-estimable outcomes as possible and for all estimated outcomes, the difference would be as close to zero as possible.Contrary to the previous evaluations, we can notice that no single run would dominate on both dimensions.

LIMITATIONS
The primary objective of this paper was to introduce the concept of evaluating automated methods for systematic reviews based on their impact on review outcomes, rather than relying on binary qrels.In this section, we reflect on the potential limitations that arise when attempting to fully operationalise our proposed framework.
Do not optimise models using this measure.A practice that can be observed across the field is treating evaluation measures as an optimisation objective.We believe that our evaluation approach should not be used for optimising models.The notion of difference in study outcomes is only known a-posteriori when the review is completed.Using absolute differences in study outcomes as an optimisation objective might lead to over-fitting to biases in data.
Other types of systematic reviews.We focus only on systematic reviews of interventions which have a clear structure and evaluate the effectiveness of specific treatments, programs, or policies by comparing experimental setups with control groups.However, there are several other types of systematic reviews, such as diagnostic test accuracy reviews, prognostic reviews, and qualitative research reviews, each of which presents unique challenges for automation and evaluation [16].Future work should investigate how this outcome-based evaluation framework can be extended to these other types of reviews.
Different outcome types.While our proposed evaluation framework focuses on continuous and dichotomous outcomes, other types of outcomes may be reported in systematic reviews, including ordinal, count, and time-to-event data.In our analysis, however, we found that continuous and dichotomous outcomes comprised most of the outcomes in the dataset we studied, accounting for 92% of all reported outcomes across 32 CLEF TAR 2019 reviews.
We believe that our evaluation framework could be generalised to incorporate other types of outcomes.Additionally, while we attempted to closely follow the evaluation protocols from the Cochrane handbook, some shortcuts were taken during the implementation process (for 2.4% of outcomes our effect calculations yielded marginally different results).In future work, ideally, access to RevMan or another official program for calculating study outcomes would be needed to make sure that all outcome types are covered.
Title and abstract screening.We work on the outcomes extracted from the full-text screening and use relevance judgments from full-text screening to judge the runs.However, most models are trained on titles and abstracts, which might make this an unfair comparison.

CONCLUSION
This paper puts forward a novel, outcome-based evaluation framework for assessing the effectiveness of automatic search strategies and citation screening methods in the context of systematic literature reviews.Our proposed framework evaluates the quality of these methods based on how closely the outcomes of their included publications match the actual review outcomes.We believe that this approach offers a more accurate reflection of real-world scenarios where not all included publications have the same impact on the final review outcome.
In addition to proposing the framework, we explore five analysis aspects that it enables, including measuring the numerical difference in predicted systematic review outcomes.We run initial experiments to simulate the impact of false negatives on reviews' outcomes showing that five missing publications per review can change 24% of outcomes.We also compare the evaluation results obtained using our framework with those obtained using traditional evaluation methods on CLEF TAR 2019 runs, highlighting the differences in focus between the two approaches.
Overall, we believe this framework represents a step forward in developing more effective and realistic methods for evaluating automation methods in the context of systematic literature reviews in medicine and in other domains in which the importance of systematic reviews is increasing.

Figure 1 :
Figure 1: Different versions of review outcomes represented as forest plots.Each row is a single study.Columns the right represent, respectively: (1) the study identifier, (2) number of events in the experimental group (e.g., patients with specific symptoms or adverse events), (3) experimental group size, (4) number of events in the control group, (5) control group size,(6) the weight of a study, and (7) effect size of a study: a difference (e.g., risk ratio or standardised mean difference) in events between experimental or control group.Simulations and figures done using RevMan Web, available at http://revman.cochrane.org.

Figure 2 :
Figure 2: Box plots presenting relative difference values from 20 simulations on the publication level.Note that the x-axis does not preserve the linear step.

Figure 3 :
Figure 3: plots presenting distance to confidence intervals values from 20 simulations on the publication level.Note that the x-axis does not preserve the linear step.

Figure 5 :
Figure 5: Linear regression fits between relative difference at 20% cut-off of documents and other evaluation measures scores.Correlations for relative difference at other cut-offs follow similar trends.

Figure 6 :
Figure 6: Visualisation of the Pareto frontier for two objectives:(1) number of non-estimable outcomes on the x-axis and (2) sum of relative difference for estimable outcomes on the y-axis.Both objectives are to be minimised.Runs are evaluated at a cut-off at 5% of the total number of documents for each review.Non-dominated runs are marked with a blue colour.
Not including study C will overestimate the review outcome, yet it will be within the 95% CI range.Not including studies A, C, D and E will overestimate the review outcome, and it will be above the 95% CI range of the original outcome.
(e) Not including studies A, B, C and D will change the study outcome -from 'favours control' to 'favours experimental'.(f) Not including any study makes the outcome non-estimable.

Table 1 :
Statistics of the considered dataset.

Table 2 :
Initial results of the simulation on the publication level.Outcomes are aggregated across 32 systematic reviews and are averaged from 20 different random seeds.