Stopping Methods for Technology-assisted Reviews Based on Point Processes

Technology-assisted Review (TAR), which aims to reduce the effort required to screen collections of documents for relevance, is used to develop systematic reviews of medical evidence and identify documents that must be disclosed in response to legal proceedings. Stopping methods are algorithms that determine when to stop screening documents during the TAR process, helping to ensure that workload is minimised while still achieving a high level of recall. This article proposes a novel stopping method based on point processes, which are statistical models that can be used to represent the occurrence of random events. The approach uses rate functions to model the occurrence of relevant documents in the ranking and compares four candidates, including one that has not previously been used for this purpose (hyperbolic). Evaluation is carried out using standard datasets (CLEF e-Health, TREC Total Recall, TREC Legal), and this work is the first to explore stopping method robustness by reporting performance on a range of rankings of varying effectiveness. Results show that the proposed method achieves the desired level of recall without requiring an excessive number of documents to be examined in the majority of cases and also compares well against multiple alternative approaches.


INTRODUCTION
Technology Assisted Review (TAR) aims to minimise the manual effort required to screen a collection of documents for relevance.Applications of TAR include scenarios in which the aim is to retrieve as many documents that meet an information need as possible, and preferably all documents.For example, in systematic reviewing, a key foundation of evidence-based medicine that is also common in other fields, research questions are answered based on information from the scientific literature [25].The standard approach to identifying relevant literature is to construct a Boolean query designed to optimise recall over precision and manually screen the results, an expensive and time-consuming process that can involve manual assessment of tens of thousands of documents [43,53].In the legal domain, electronic discovery (eDiscovery) is the identification of documents for legal purposes, such as disclosure in response to litigation [24,46,49] or to meet the requirements of freedom of information (FoI) legislation [8,42].In eDiscovery, it is important to identify as many relevant documents as possible given the resources available to ensure compliance with legal obligations and avoid potential penalties.Identifying relevant information in response to FoI requests ensures that sensitive information is not released inadvertently.In Information Retrieval (IR), test collections are a key component of the standard evaluation methodology.Maximising the number of relevant documents identified reduces the potential for bias when evaluating retrieval models [19] but becomes more difficult to achieve as increasing volumes of information become available in electronic format and the size of these collections increases.
Approaches to the TAR problem generally focus on the development of efficient ranking approaches that aim to rank relevant documents highly, thereby ensuring that they are discovered as early as possible.Continuous Active Learning (CAL) has proved to be a successful version of this approach [13,14,16,39].CAL relies on a classifier to rank documents in the collection.Initial training of the classifier can be achieved in various ways such as using a small number of relevant documents (often referred to as "seeds") or using the query as a pseudo-document.The classifier is then applied to the document collection and some portion of the documents examined.The relevance judgements produced by this process are used to re-train the classifier which is then used to re-rank the remaining documents.The classifier's accuracy improves as the process is repeated which leads to the relevant documents being identified early in the ranking.
However, even the most effective document ranking does not reduce reviewer workload if they are still required to screen all documents in the collection.A key problem within TAR is therefore deciding when a reviewer can stop examining documents [15].This leads to the need for effective stopping methods which reviewers can combine with ranking approaches, such as CAL, to inform their decision about whether to stop examining documents.A reviewer's target recall for a TAR problem is the minimum percentage of relevant documents in a collection that they aim to identify before they cease examining documents.Following Cormack and Grossman [15], a TAR stopping method is a mechanism to predict when a reviewer has examined a sufficient number of documents to achieve the target recall while also minimising the total number of documents examined.
A range of stopping methods has been proposed in the literature (see Section 2 for a more complete review).The simplest of these are based on ad-hoc methods to identify the point in the ranking where the target recall has been reached, e.g.[40,50].These approaches do not provide the reviewer with any indication of confidence in their decision and often rely heavily on parameters being set to appropriate values.Another, more common, approach is to attempt to estimate the total number of relevant documents in the collection and inform the user when they have reached the target recall.This may be achieved by training a classifier (such as one developed for a CAL-type approach) and using it to estimate the number of relevant documents in the unexamined portion of the ranking, e.g.[11,28,63].However, these approaches generally assume that the rate at which documents occur in the unexamined portion is the same as the portion that has been observed, which is unlikely to be the case for any reasonable document ranking.This paper builds on previous work on TAR stopping methods, particularly Cormack and Grossman [15] and Li and Kanoulas [39], to develop a novel approach based on point processes [55,56].Point processes are well understood statistical models which use information about the rate at which relevant documents are observed in a ranking to make inferences about the total number in (part of) a collection.They have the advantage of being able to model the fact that, in any reasonable ranking, relevant documents are more likely to appear early in the ranking.This paper develops a stopping method based on two types of point processes: Poisson Process and Cox Process.It also compares four approaches to modelling the rate at which relevant documents occur, including three that have been used in previous work on stopping methods and one which has not.These methods are evaluated and compared against alternative approaches on a range of datasets used to evaluate TAR approaches: the CLEF Technology-Assisted Review in Empirical Medicine [30][31][32], the TREC Total Recall tasks [24] and the TREC Legal Tasks [17].These experiments Manuscript submitted to ACM include evaluation using the complete set of runs submitted for the CLEF Technology-Assisted Review in Empirical Medicine dataset, allowing the assessment of their effectiveness over rankings of varying effectiveness. 1he contributions of this work can be summarised as follows: • Proposes a novel stopping method for TAR based on point processes.
• Introduces the hyperbolic function to model the rate at which relevant documents are found in a ranking.
• Carries out experiments on a range of benchmark data sets to verify the effectiveness of the proposed approach and compare it against several alternative methods, including a generalised version of Cormack and Grossman's target method [15].
• Explores various configurations of the point process approach to discover the most effective.These configurations include two types of point process (Poisson Process and Cox Process) and four functions to model the rate at which relevant documents appear in a ranking (hyperbolic and three that have previously been used for this task: exponential function [55], power law [70] and AP Prior distribution [39]).
• Apply the proposed approach to a range of rankings, of varying effectiveness, to demonstrate its robustness.

PREVIOUS WORK
The problem of stopping methods for TAR has been discussed both within the literature associated with Information Retrieval and areas where document review tasks are commonly carried out such as eDiscovery [37,62,65] and systematic reviewing, both in medicine [54,60] and other areas such as software engineering [69] and environmental health [28].
Perhaps the most obvious approach to developing a stopping method is to estimate the total number of relevant documents in the collection, R, and then stop when ℓR have been identified, where ℓ is the target recall.A number of approaches has been developed based on this strategy and are discussed in Section 2.2.Stopping methods that do not attempt to estimate R directly have also been described in the literature and we start by discussing these in Section 2.1.

Stopping without Estimating R
The simplest stopping rules methods are based on heuristics such as stopping after a sequence of irrelevant documents has been observed, for example Ros et al. [50] stop after 50 consecutive irrelevant documents are observed.A range of similar approaches has been employed for the problem of deciding when to stop assessing documents during test collection development [40], including stopping after a fixed number of documents has been examined, stopping after a defined portion of all documents in the entire collection has been examined, stop after a fixed number of relevant (or non-relevant) documents has been observed and stop after a sequence of  non-relevant documents has been observed.These approaches have the advantage of being straightforward to understand and implement.
The knee method [15] is designed to exploit the fact that relevant documents tend to occur more frequently early in the ranking and is based on the observation that examining additional documents often leads to diminishing returns.
The approach makes use of a "knee detection" algorithm [51] to identify an inflection point in the gain curve produced by plotting the cumulative total of relevant documents identified against the rank.The slope ratio, , at a point in the gain curve is computed as the gradient preceding that point divided by the curve gradient immediately following it.A suitable stopping point is one where the gradient drops quickly, i.e. a high slope ratio.Cormack and Grossman [15] suggest 6 as a suitable value of  in experiments where the target recall is 0.7.The effectiveness of the knee methods Stevenson and Bin-Hezam depends heavily on the value of  used which may vary according to target recall and TAR problem; for example later work found that different values of  were more effective [39].
Di Nunzio [23] made use of the scores produced by a ranking algorithm (BM25) to predict the conditional probabilities of each examined document being relevant (or irrelevant) given the set of relevant (or irrelevant) documents identified so far.These values are then used to represent each document in a 2-dimensional space in which the stopping problem becomes one of finding a decision line in this space.
A disadvantage of all these approaches is that they do not provide the reviewer with any information about the level of recall that has been achieved or confidence that that target recall has been achieved at the point that stopping is recommended.They may also rely on the values of key parameters (e.g.length of the sequence of irrelevant documents observed) and the most suitable values for these may vary between TAR problems.
The target method [15] attempts to overcome these limitations with an approach that guarantees a target recall will be achieved with a specified confidence level.The approach proceeds by randomly sampling documents from the collection to identify a "target set" of relevant documents.Once these have been identified, all documents in the ranking are examined up to the final one in the target set.The number of relevant documents required for the target set is informed by statistical theory.Cormack and Grossman [15] state that a target set size of 10 is sufficient to guarantee recall of 0.7 with 95% confidence. 2 The Quantile Binomial Coefficient Bound (QBCB) [36] approach is a variant of the target method which assumes that a control set of labelled documents is available.Like the target method, this approach specifies a minimum number of relevant documents that has to be identified from the control set before the method stops.This number is determined in a different way to the target method to avoid potential statistical bias from sequential testing.However, identifying a suitable control set can be a challenge in practice, particularly when prevalence is low as is often the case in TAR problems.
A significant advantage of target and QBCB methods is the probability guarantees they provide about the target recall being achieved.However, one of their underlying assumptions is that the probability of a document being relevant does not vary through the ranking, which is unlikely to be the case in any reasonable ranking, leading to more documents than necessary being examined.

Stopping by Estimating R
An approach that has been explored by several researchers has been to examine documents up to a particular point in the ranking and then estimate the number of relevant documents remaining in some way, such as examining a sample (Section 2.2.1), applying a classifier trained on the examined documents (Section 2.2.2) and using ranking scores (Section 2.2.3).Each approach is now discussed in turn.

Sampling Approaches.
Much of the work on sampling approaches for estimating R has been carried out within the context of work on systematic reviews in medicine.Shemilt et al. [54] estimate the number of relevant documents remaining by sampling the unexamined ones.A statistical power size calculation was used to determine the size of the sample required in order to ensure that the estimate is within a desired level of confidence.Their approach was evaluated on two scoping reviews in public health, each of which involved the screening of extremely large sets of documents returned by queries (> 800, 000).However, such an approach is sensitive to the estimate of the prevalence of relevant documents in the unexamined portion.
Howard et al. [28] describe a similar approach in which the number of relevant documents remaining is modelled using the Negative Binomial distribution.The number of relevant documents remaining is estimated simply as the total number of documents multiplied by the estimated probability of relevance, which is itself estimated by examining the documents most recently examined in the ranking.Callaghan and Müller-Hansen [11] point out that the hypergeometric distribution is more appropriate for sampling without replacement and therefore better suited to model the situation that occurs when unexamined documents are sampled (since it would make no sense to return a document to the set of unexamined ones after a judgement on its relevance has been made).They combined the hypergeometric distribution with statistical hypothesis testing to develop a stopping rule that takes account of the desired confidence.Their approach was evaluated on a set of 20 systematic reviews from medicine and Computer Science that had been used in previous research on stopping criteria.These approaches use established statistical theory to estimate the number of relevant documents remaining.However, they do not make use of the fact that, for any reasonable ranking, the probability of observing a relevant document decreases as the rank increases so they risk examining more documents than necessary.
The S-CAL [16] and AutoStop [39] approaches address this by estimating R using nonuniform sampling strategies to reduce the number of documents that need to be examined.S-CAL [16] was developed within the context of a CAL system [14] to produce an algorithm designed to achieve high recall for very large (potentially infinite) document collections.S-CAL examines a stratified sample across the collection where the inclusion probability decreases as the rank increases, rather than applying CAL to the entire collection.A classifier is used to carry out an initial ranking of the sample which is then split into batches.Relevance judgements are then obtained from a subset of documents within each batch and used to both estimate the number of relevant documents within that batch and as additional training data for the classifier.The algorithm proceeds until the number of relevant documents within each batch has been estimated and these figures are combined to estimate the total number of relevant documents.Similarly, AutoStop [39] makes use of Horovitz-Thompson and Hansen-Huruwitz estimators [27,59] to provide unbiased estimates of R that take account of the decreasing probability of relevant documents being observed.(The Horovitz-Thompson estimator had been previously used to estimate the prevalence of relevant documents [60], where it was shown to be more accurate than uniform random sampling, although that work did not go on to use the information provided to develop a stopping method.)Stopping rules are based on either the estimator's direct output or this value with the variance added (to account for the estimate's uncertainty).The estimators employed by this approach rely on a suitable distribution for the sampling probabilities of each stratum of the sample, that is the probability of each document within that sample being relevant.Li and Kanoulas [39] found the AP-Prior distribution [6,47] as the best performing.

Classification-based Approaches.
As an alternative to sampling, which requires additional documents to be screened, recent approaches [63,69] have used the relevance judgements from the observed documents as training data for a supervised classifier which is then used to estimate the number of relevant documents in the unobserved portion without the need for additional manual examination.These approaches are developed within ActiveLearning frameworks which already require the development of a classifier to rank unexamined documents so the extension to stopping rules represents limited additional effort.Yu and Menzies [69] use the Support Vector Machine model employed within their Active Learning system to add "temporary labels" to the unexamined documents which are used to train a logistic regression classifier used to estimate the total number of relevant documents in the unexamined portion.Yang et al. [63] present a similar approach in which a logistic regression classifier is trained on the observed documents and applied to the unobserved potion.A point estimate of the total number of relevant documents is calculated together with an estimate of its variance and used to produce two stopping rules: one based on the point estimate of the total Manuscript submitted to ACM number of documents and another where twice the variance of this estimate is added (equating to approximately a 95% confidence interval on the estimate).This method is essentially an example of the "classify and count" approach to the more general problem of volume estimation.However, del Coz et al. [22] pointed out that this approach is sub-optimal, not least because the prevalence of relevant documents in the observed and unobserved portions are likely to differ.
2.2.3 Score Distribution Approaches.Hollmann and Eickhoff [26] made use of the scores assigned by a ranking algorithm to estimate R. Following a standard approach [2], the distribution of relevant documents is modelled as a Gaussian random variable and used to compute the probability of each document being relevant based on the score assigned to it by the ranking algorithm.The total number of relevant documents at each point in the ranking can then be estimated by summing these probabilities and this information is used to identify when a particular level of recall has been achieved.Cormack and Mojdeh [18] fitted a normal distribution to the scores of the relevant documents that had been identified and used the area under the curve to estimate R.These approaches, and Di Nunzio [23] (see Section 2.1), are applications of score distribution methods [2,29] to the stopping problem.

Summary
Some approaches to the TAR stopping problem are based on simple heuristics that may be effective under certain circumstances but are not likely to be generally reliable (see Section 2.1).Attempts have been made to develop more robust stopping rules that offer some assurance that the target recall has been reached.The most straightforward way of achieving this is to estimate the total number of relevant documents but this generally proves to be expensive with large numbers of documents having to be examined to achieve the levels of statistical reliability that are sought (see Section 2.2).Current methods that provide assurance without estimating R directly (e.g.target method [15] and QBCB [36]) do not model the fact that the prevalence of relevant documents is likely to reduce substantially with the ranking which also leads to more documents being examined than necessary.
This paper provides an alternative approach to the stopping problem that makes use of a well-established stochastic model (counting processes) to estimate R and thereby produce a stopping criterion.The approach has the advantage that the estimate can be made by examining the top ranked documents, which are most likely to be relevant, thereby reducing the overall number of documents that need to be examined.

POINT PROCESSES
In their most general sense, point processes can be viewed as a stochastic model of a random element (i.e.generalised random variable) defined over a mathematical space and with values that can be considered as "points" within that space [21,57].They are often applied within spatial data analysis where they have been applied to a wide range of disciplines including epidemiology, seismology, astronomy, geography and economics.Point processes defined over the positive integers have proved to be particularly useful since they can be used to model the occurrences of random events in time, for example, the arrival of customers in a queue, emissions of radioactive particles from a source or impulses from a neuron.Applications within Computer Science include queuing theory [10], computational neuroscience [44], social media analytics [41] and modelling user interaction with recommendation systems [58].
In the application of point processes described here, the space is a ranking of documents and the random event is the occurrence of a relevant document in this ranking.The description of point processes which follows, therefore focuses on this application rather than considering more general types of point processes.

Manuscript submitted to ACM
We begin by introducing the point processes used in this paper (Sections 3.1 and 3.2) and then describe candidate models for the occurrence of relevant documents (Section 3.3).

Poisson Processes
Poisson Processes [35] are an important type of point process which assume that events occur independently of one another and the number of occurrences in a given interval follows a Poisson distribution.They are suitable for situations that can be modelled as a large number of Bernoulli trials with a low probability of success in each trial [57], such as TAR problems where the prevalence of relevant documents is normally very low.Poisson Processes can be used to estimate the number of occurrences of relevant documents found within some portion of the ranking.The average frequency with which relevant documents are observed is denoted by a parameter  which is referred to as the rate and assumed to be greater than 0.
In addition,  (, ), the number of relevant documents between ranks  and , is a Poisson distribution with the parameter (  − ) with the probability that number equals  given by i.e. a Poisson distribution with a mean of 4.5.

Inhomogeneous Poisson
Processes.Assuming that the rate at which relevant documents are observed stays constant is not reasonable in practice since for any reasonable retrieval system relevant documents are more likely to be found earlier in the ranking.This can be taken account of using a rate that varies as a function of the ranking to produce an Inhomogeneous Poisson Process.Let () be a rate function where  is a position in a ranking, i.e.  ∈ {1, 2, 3, . . .,  } for a ranking of  documents. 3Λ(, ) is defined as the integral of the rate function, (), between ranks  and , i.e.
Then  () is a modelled as a Poisson distribution with the parameter Λ(0, ), that is the probability of  () having the value  is given by: In addition, the number of relevant documents between ranks  and ,  (, ), is a Poisson random variable with parameter Λ(, ), so the probability of observing  relevant documents is given by: For example, if the rate function is () =  −2 and we, again, wish to estimate the number of relevant documents between ranks 10 and 100.Then Λ(10, 100) = 0.09 so  ( (10, 100) = ) ∼ (0.09).

Cox Processes
In our application, the rate function, (), represents the probability of a relevant document being observed at a particular rank, which is not straightforward to estimate.
where  () is the probability of the rate function taking a particular value so that  ( (, ) = ) is estimated by integrating over all possible values of .
In practice, a general form is chosen for the rate function, for example, () =  − where  is a parameter used to select particular functions.So,  = 2 would give the function () =  −2 used in the Inhomogenous Poisson Process example (Section 3.1.2).The parameters of the rate function are assigned values from some probability distribution, which produces a distribution over possible rate functions,  ().

Rate Functions
Selecting an appropriate general form for the rate function is a key decision in the application of point processes.An appropriate function should assume that a suitable ranking has, in accordance with the probability ranking principle [48], succeeded in placing documents that are more likely to be relevant higher in the ranking than those less likely to be and, consequently, the rate at which relevant documents occur decreases in direct proportion to the document's position in the ranking.A range of suitable functions exist which we now discuss.

Exponential function.
The mathematical properties of the exponential function make it a convenient choice of rate function.It is defined as where  is an index in a ranking (i.e. ∈ {1, 2 . . . } for a collection of  documents) and ,  ∈ R are parameters controlling the function's shape.Substituting into Equation 4, the expected number of relevant documents between index  and index  is given by: Combining Equations 6 and 9, the probability of observing  relevant documents between ranks  and  is given by: Manuscript submitted to ACM Equation 10 provides a convenient and easily computable closed form solution for estimating the number of relevant documents.

Hyperbolic Decline.
The hyperbolic decline function also meets the criteria for a suitable rate function.It is widely used in the field of petroleum engineering to model declining productivity of oil and gas wells to predict future output [3] but, to the best of our knowledge, has not previously been used in IR.The function is defined as: where  is, once again, an index in a ranking and ,  and  are parameters controlling the shape of the function with 0 ≤  ≤ 1.Note that when  = 0 equation 11 becomes equivalent to exponential decline (Section 3.3.1)while  = 1 produces a harmonic decline function.
Integrating equation 11 produces: Equation 12 can be substituted into equation 6 in a similar way to the exponential function (see Section 3.3.1)to create a random variable to estimate the number of relevant documents in a portion of the ranking.

Power Law.
Power laws have been proposed as a suitable model of the rate at which relevant documents are observed in a ranking [70] and have been shown to be useful for estimating the number of relevant documents remaining for test collection development, e.g.[40].Power laws have the form where  is an index in the ranking and the parameters ,  ∈ R determine the function's shape.Substituting this into equation 4 produces which can also be substituted into Equations 6 and 7 in a similar way to the previous rate functions.

AP Prior Distribution.
The AP-Prior distribution [6,47] has been applied in IR evaluation and demonstrated to be a suitable prior for the relevance of documents in a ranked list [4,5,7,38,67,68].It was also used in the AutoStop algorithm [39] (see Section 2.2).The AP-Prior distribution models the probability of relevance at each rank based on its contribution to the average precision score where  is (again) an index in the ranking,  the total number of documents in the collection and  a normalisation factor.The integral of Equation 15is easier to derive after some rearrangement: So, Unlike the other rate functions, the AP-Prior is a probability distribution, i.e. sums to 1 over all documents in the ranking.To provide a point process rate function it needs to be scaled based on the expected total number of relevant documents in the ranking which can be achieved by multiplying Equation 17 by a scalar, .The value of  then becomes a parameter controlling the function's shape similar to the parameters that control the shape of the exponential, hyperbolic and power law rate functions.This rate function can then be combined with Equations 6 and 7 to produce a point process using the same approach that was used for the other rate functions.

STOPPING ALGORITHM
The point process framework described in the previous section allows us to define a stopping method.Briefly, the approach operates by screening the top ranked documents, from rank 1 to rank , and counting the number of relevant documents, referred to as  (1, ).The point process is then used to estimate the number of relevant documents in the remaining (i.e.unexamined) part of the ranking (i.e.ranks  + 1 . . ., where  is the total number of documents), denoted as  ( + 1, ).An estimate of the total number of relevant documents in the entire ranking, R, is then given by R =  (1, ) +  ( + 1, ) and this value used to estimate the number of relevant documents required to reach a given target recall, ℓ, i.e. ℓ R. The algorithm stops if a sufficient number of relevant documents has been found to reach the desired level of recall (i.e. (1, ) ≥ ⌈ℓ R⌉, where ⌈.⌉ is the ceiling function), otherwise the process is repeated after screening more of the documents (i.e.increasing the value of ).
A key part of this process is using the point process to estimate R. Documents that have been examined are analysed to estimate the probability of a relevant document being encountered at each point in the ranking and one of the rate functions described in Section 3.3 fitted (see Section 4.1 for additional details about this process).This rate function is then used to produce a point process that estimates the number of relevant documents that will be encountered by any point in the unexamined documents found later in the ranking by modelling this value as a Poisson random variable.
If the first  documents in a ranking of  documents have been screened then the number of relevant documents in the unscreened portion of the ranking is modelled as the random variable  ( + 1, ) ∼ ().By examining its cumulative distribution function (CDF), it is possible to estimate the maximum value of  ( + 1, ) with some desired level of probability, .
A visualisation of the process is shown in Figure 1.

Fitting the Rate Function
A set of points representing estimates of the probability of encountering a relevant document in the screened documents (i.e.ranked from 1 to ) are created by averaging the number of relevant documents observed within a sliding window.
Non-linear least squares [34] is then used to find the parameters of the rate function being used that best fit that data.
While the rate function is fitted using information derived from the first  documents, it is then extrapolated across the entire ranking using the parameters produced by the fitting process and the estimates of the probability of observing a relevant document it produces used by the point process.Since it is important to ensure that these estimates are reliable, the rate function is only fitted if a sufficient number of relevant documents has already been observed in the first  documents.Two different approaches to determining this were applied.Firstly, the "static" approach checked if a fixed number of relevant documents had been observed.The values 10 and 20 were explored, so the rate function would only be fitted if  (1, ) ≥ 10 and  (1, ) ≥ 20, respectively.The second, "dynamic" approach reduced the number of relevant documents that had to be observed as the ranking increased, so that the rate function would only be fitted if , where  is the total number of documents in the ranking.
Manuscript submitted to ACM If enough relevant documents have been observed and a rate function fitted then an additional check is carried out by measuring the difference between the observed values and those predicted by the rate function using Normalised Root Mean Squared Error (NRMSE).NRMSE measures the difference between observed values and those predicted by a model by computing the average of the squared differences between them and then normalising that value by the range of the observed data.More formally, NRMSE is computed as where   and ŷ are (respectively) the observed and predicted values for the probability of document relevance while   and   are (respectively) the highest and lowest observed probabilities of document relevance.Note that NRMSE is computed using only the observed and predicted values for the first  documents (i.e.those which have already been screened).If the NRMSE value exceeds a threshold, then the fitted curve is not considered to be an accurate model for the true rate.If this happens the algorithm does not attempt to compute the point process, since its results may not be reliable, and there is no attempt to estimate the total number of relevant documents until further documents have been screened.Several values for this threshold are explored in the experiments reported later (see Section 6.1.1).

Manuscript submitted to ACM
If the point process being used is an inhomogeneous Poisson Process (Section 3.1.2)then the point estimates of the rate function parameters can simply be supplied to the closed form of the relevant integral shown in Section 3.3.
However, when a Cox Process (Section 3.2) is being used, the point process also considers the estimated variance of these parameters.The parameters of the rate function are modelled using a normal distribution, N (, ), where  is the least square estimate of the parameter value and  its estimate of the variance.Unfortunately, no convenient closed forms exist for the integrals required to compute equation 7 and instead, they are computed using numerical integration (Simpson's rule). (1, ) ← count of relevant documents found in ranks 1 to  Obtain relevance judgements for ranks  + 1 to ( + 1) + ( × )

19:
←  + ( × ) 20: end while 21: return  Pseudocode for the stopping method is shown in Algorithm 1.The method is provided with several pieces of information (line 1).The target recall level (ℓ) and confidence levels () indicate the desired level of recall and the algorithm's confidence that this has been achieved prior to stopping.The total number of documents in the ranking () must also be provided together with parameters controlling the number of documents that are examined between each application of the point process to check whether the stopping point has been reached ( and ).The values of  and  could be adjusted to check whether the stopping point has been reached as frequently as required, and potentially for every document in the ranking, although more frequent estimates would increase the computational cost.The algorithm outputs a rank at which it estimates that the recall and confidence (ℓ and ) targets have been met so the screening of documents can cease (line 2).
The algorithm begins by obtaining relevance judgements for the top ranked documents (lines 3 and 4) and counting the number which are relevant (line 6).At this point there is a check whether enough relevant documents have been Manuscript submitted to ACM found to attempt to fit a rate function (line 7), if not then the number of screened documents is gradually increased until enough have been found.Assuming that a sufficient number of relevant documents has been found, a rate function is fitted (line 8) and checked by computing the NRMSE (line 9) (see above).A point process is run and its output used to estimate the number of relevant documents in the portion of the ranking that has yet to be screened (lines 10 and 11).
This information is then used to estimate the total number of relevant documents in the entire ranking (line 12).The algorithm stops and returns the current rank if enough relevant documents have already been observed to achieve the target recall (line 14), otherwise the next highest ranking documents in the ranking are screened and the process is repeated (lines 18 and 19).The process of increasing the number of screened documents continues until either the algorithm concludes that the target recall has been reached or all documents have been screened.

Properties of Approach
The stopping method outlined above has a number of advantages.Firstly, the screening effort is focused on the top ranked documents, in other words, those which are most likely to be relevant.Unlike some other approaches (e.g.[15,39]), there is no need to obtain relevance judgements for documents sampled from across the ranking, thereby leading to additional effort that is unlikely to identify additional relevant documents.Secondly, the proposed method provides an estimate of the number of relevant documents at any point in the ranking using a well understood statistical model.Consequently, recall can be estimated for any point of the ranking, unlike approaches that identify a stopping point only for a pre-specified ranking (e.g.[15]) or provide an estimate of the recall achieved at a particular rank (possibly with an associated confidence value) but do not provide information about the recall that is likely to be achieved after further documents have been screened (e.g.[54,63]).Finally, the computational effort required is relatively modest.The stopping point is identified by examining the distribution of the estimated number of relevant documents which can be calculated from the expression produced by combining one of the rate functions with the point process (see, for example, Equation 10).

Baselines
Comparison of stopping algorithms has been a challenging problem since they are deployed within different retrieval frameworks, each of which uses its own ranking, and using different implementations.The situation has improved recently with the release of reference implementations for a range of stopping methods [39,64].Although we found that integrating new approaches into these frameworks was less straightforward than we had hoped, we were able to extract the rankings used by the reference implementations described by Li and Kanoulas [39] which allowed us to directly compare the performance of our approach against a range of alternative approaches, particularly Cormack and Grossman [15] and Li and Kanoulas [39] (see Section 5.4).
We compared the proposed method against several previous approaches described in Section 2 with results produced using the reference implementation provided by Li and Kanoulas [39]: target4 [15], knee [15], SCAL [16], AutoStop [39], SD-training and SD-sampling [26]. 5Results for the QBCB method [36] are also reported.We were unable to find a reference implementation for this approach so results were produced using a modified version of the Li and Kanoulas [39] implementation of the target method.The QBCB method requires the size of the control set to be specified.We chose a value of 50 to balance accuracy and cost with the relevant documents included in the set identified by random sampling.The QBCB method assumes that the control set is provided to the algorithm but this can be challenging for high target recalls, for example for a target recall of 0.9 the control set would need to contain 49 relevant documents.
We obtain these by randomly sampling until the required number of relevant documents for the control set has been identified but do not include any sampled documents that occur after the algorithm stops in the calculations of the algorithm's cost.This optimistic assumption about the availability of a control set benefits the QBCB algorithm.

Adapted target method (TM-adapted).
We also experimented with a more generalised version of the target method.The original description of this approach shows that a target size of 10 relevant documents is sufficient to achieve recall ≥ 0.7 with 95% confidence [15] but does not state the number that needs to be identified for other recall and confidence levels.By generalising the argument in Cormack and Grossman [15], it can be shown that the required number is − log(1 − ) 1 − ℓ where ℓ is the desired level of recall (e.g.0.7) and  is the confidence in this level of recall being achieved (e.g.0.95).
(See Appendix A for details about how this result was derived.)For example, 30 relevant documents must be observed in the random sample when ℓ = 0.9 and  = 0.95.Results for this approach are generated by varying the target number using the same reference implementation used for the standard target method [39].

Oracle.
Results of an Oracle approach are also reported.The oracle starts at the top ranked document and continues through the ranking until enough documents have been observed to reach the desired recall.This approach is not practically feasible since it assumes complete information about the ranking (including the relevance of documents beyond those that have been observed).However, it is useful to provide context for other methods by indicating the minimum number of documents that need to be examined in a fixed ranking in order to achieve the desired recall.This number will vary according to the individual ranking -it will be lower when relevant documents have been ranked highly and lower when they have not -and places a limit on how early a method can stop while achieving the target recall.
It is worth noting that the recall achieved by the oracle method can be higher than the target recall under certain circumstances.This happens when the number of documents in the topic makes it impossible to stop at the target recall exactly and in these cases the oracle stops at the lowest possible recall above the target.For example, if the target recall is 0.8 and a topic contains 11 relevant documents then the oracle method will stop after 9 relevant documents have been identified, representing a recall of approximately 0.818.

Evaluation Metrics
Stopping methods aim to identify a set portion (possibly all) of the relevant documents in a collection, the target recall, while requiring that as few documents as possible are manually examined.This can be viewed as a multi-objective optimisation problem that aims to both (1) maximise the probability that the number of documents identified is at least the target recall, and (2) minimise the number of documents that need to be manually reviewed.These objectives are generally in opposition since increasing the probability of achieving the target recall normally requires more documents to be reviewed, and vice versa, making it difficult to summarise them using a single metric.

Manuscript submitted to ACM
A wide range of metrics has been used to evaluate previous work on stopping methods.To simplify comparison with previous work, we adopt the same metrics as those reported previously [39].All metrics were computed using the tar_eval.pyscript provided for the CLEF Technology Assisted Review in Empirical Medicine tasks. 6ecall: The recall metric is the proportion of relevant documents identified.It is defined as: where  is the number of relevant documents identified and R the total number of relevant documents in a collection.
Cost: The cost metric is the proportion of the total documents in the collection that have to be manually reviewed before a stopping point is identified.It is defined as: where  is the number of documents that need to be examined and  is the total number of documents in the collection.
Reliability: The reliability metric, due to Cormack and Grossman [15], is the proportion of topics in a collection where an approach achieves the target recall.Let C be a collection of topics, then the reliability of an approach over C is given by: where   is the target recall.The reliability metric is unique among those used in this work in that it is defined over a collection of topics, rather than a single topic.
Relative error: The relative error metric is the normalised absolute difference between the recall achieved by a stopping method and the target recall.It is defined as loss er : The   metric [15] is designed to be a single metric that captures the two objectives for the stopping task.
Its development was informed by experience from TREC Total Recall tracks [24,49] and also adopted by the CLEF Technology Assisted Reviews in Empirical Medicine Tracks [30][31][32].The   measure is the sum of two components: and   .The first of these is defined as a quadratic loss function that penalises a method for failing to achieve 100% recall: It is worth mentioning that   assumes that a method aims to achieve 100% recall.While this might be desirable in many circumstances, it might not always be the case.It would be straightforward to adapt   to only penalise a method only when its recall is below a set target recall (i.e.  ), but we choose not to adjust the method to simplify comparison with previous work and because the relative error measure already captures information about the difference between the achieved and target recall.The second component,   , motivated by experience from the TREC 2015 Total Recall Track [49], is defined as: This metric is motivated by the observation that a "reasonable" effort might be given by  +  where  represents an effort proportional to the total number of relevant documents and  a fixed cost.(Note that  = 1 and  = 0 is the ideal scenario where effort is minimised as far as possible.)The values of  and  are somewhat arbitrary, previous work [15] suggested that  ≤ 2 and  ≤ 1000 would be a reasonable effort to achieve recall ≥ 0.7 with 95% confidence.
We follow the CLEF Technology Assisted Reviews in Empirical Medicine Tracks [30][31][32] and Li and Kanoulas [39] in choosing  = 1 and  = 100.Then the  +100 element represents the proportion of documents examined compared to a "reasonable" effort.The 100  element is a weight that determines the importance of this type of loss.(See Cormack and Grossman [15] for further discussion of the motivation behind   .) The   measure itself is defined as the sum of the   and   components:   =   +   .

Datasets
Evaluation is carried out using common benchmark data sets representing TAR problems from a range of domains: the CLEF Technology-Assisted Review in Empirical Medicine, the TREC Total Recall Tasks and the TREC Legal Tasks.The data sets used are the same as those used in previous work [39] to facilitate comparison.
CLEF Technology-Assisted Review in Empirical Medicine 7 The CLEF task on TAR in empirical medicine focused on the identification of evidence for systematic reviews.These reviews support evidence-based approaches to medicine by identifying, appraising and synthesising and summarising current knowledge in relation to a research question, for example Rapid diagnostic tests for diagnosing uncomplicated P. falciparum malaria in endemic countries [1].
Identification of as much relevant evidence as possible is a key priority in systematic review development.
The task was run from 2017 to 2019 and three data sets were produced, one for each year the task was run: CLEF2017, CLEF2018 and CLEF2019.The first two data sets contained exclusively Diagnostic Test Accuracy reviews (the goal of which is to determine the effectiveness of some medical diagnosis method).The CLEF2019 data set extended this to several other review types: Intervention, Prognosis and Qualitative.Following Li and Kanoulas [39], only the Diagnostic Test Accuracy reviews are used for the experiments reported here, yielding 30 reviews8 (topics) from CLEF2017, 30 from CLEF2018 and 31 from CLEF2019.
Each topic in the CLEF2017, CLEF2018 and CLEF2019 data sets was derived from a systematic review produced by the Cochrane Collaboration. 9 The document collection was the Medline database containing abstracts of scientific publications in the life sciences and associated fields.Topics consist of a topic/review title, a Boolean query developed by Cochrane experts and the set of PubMed Document Identifiers (PMIDs) returned by running the query over Medline.
The goal of the task is to identify PMIDs of scientific papers that were included in the review, a time consuming task that is normally carried out manually.The topic titles are generally significantly longer and contain more technical terminology than those normally submitted to search engines.
TREC Total Recall 10 The goal of the TREC Total Recall Track is to assess TAR methods with a human assessor forming part of the retrieval process (so the ground truth document relevance is revealed for each document immediately following its retrieval) and which aims to achieve very high recall (as close 100% as possible).Following Li and Kanoulas [39], the athome4 dataset from the TREC 2016 Total Recall track [24] is used to test approaches.This data set consists of 34 topics.
The document collection for the data set consists of 290,099 emails from Jeb Bush's eight-year tenure as Governor of Florida that was also used for the previous year's Total Recall exercise [49].Each topic is based on an issue associated with Jeb Bush's governorship, e.g.Felon disenfranchisement and Bottled Water.Topics consist of a short title, normally a few words long and similar to the queries typically submitted to search engines (e.g.Olympics), and a slightly longer textual description (e.g.Bid to host the Olympic games in Florida).
TREC Legal 11 The TREC Legal track [17] focuses on TAR in the eDiscovery process where the aim is to identify (nearly) all documents relevant to a request for production in civil litigation while minimising the number of non-relevant documents examined.Topics 303 and 304 from the interactive task of the TREC 2010 Legal track are used.
The document collection is a version of the ENRON data set based on the emails captured and made public by the Federal Energy Review Commission as part of their investigation into the collapse of Enron.This version contains 685,592 documents made up from 455,449 email messages and 230,143 attachments.Topics in this data set take the form of mock legal complaints that request disclosure of documents containing specific information (e.g.topic 303 requests documents containing information related to the lobbying of public officials).In addition, topic 304 is a "privilege" topic intended to model a search for documents that could be withheld from a production request on the basis of legal privilege.

Rankings
Stopping methods operate over a ranking of documents in a collection.Some approaches have chosen to closely integrate the stopping method with the ranking process (e.g.[16,26,39,63]) while others, including the one presented here, can be applied to any ranking of the collection (e.g.[11,15,28]).The goals of the evaluation include comparing the proposed approach against existing methods and determining how robust approaches are under a range of rankings.
Ideally, it would have been possible to evaluate the approaches against multiple rankings for each data set, however, these are not always available, and evaluation was carried out using two sets of rankings: • The first ranking used for the evaluation is produced by the AutoStop system [39].The AutoStop ranking algorithm is a CAL approach based on AutoTAR [16] and represents state-of-the-art performance.The reference implementation of AutoStop provided by Li and Kanoulas [39] was used to provide a ranking for each of the datasets used in the evaluation.These rankings allow the approaches to be evaluated and directly compared with existing approaches on multiple datasets.This ranking is used for the experiments reported in Sections 6.1, 6.2, 6.3, 6.4, 6.5 and 6.7.
• The second set of rankings were produced by participants in the CLEF task Technology-Assisted Review in Empirical Medicine [30,31].Rankings produced by systems that took part in the evaluations were made available by the task organisers. 12The description of the CLEF2017 evaluation [30] states that 33 of the runs made available ranked the full set of documents returned by the Boolean query, however, four of these appear to contain fewer documents than the others and were therefore excluded from the experiments.Similarly, for the CLEF2018 task, 22 rankings were made available but documents were missing from eleven of these, leaving the remaining eleven for the experiments. 13(Rankings from CLEF2019 were not included in the experiments, given the small number of participants in the task's final iteration and the limited number of rankings available.)

Stevenson and Bin-Hezam
Results of the CLEF Empirical Medicine evaluations revealed that the rankings submitted varied considerably in their effectiveness, which is to be expected since the submissions ranged from applications of state-of-the-art approaches to experimental systems and (in the case of two runs) baseline approaches designed to provide context.These rankings can therefore be used to explore how the stopping approaches are affected by ranking effectiveness.Results of experiments using these rankings can be found in Section 6.6.It is worth mentioning that previous stopping methods have been evaluated against single rankings.Evaluation using the multiple rankings available for this dataset provides valuable insight into the relationship between ranking and stopping effectiveness.

Baseline Comparison
6.1.1Hyperparameter Tuning.The approach proposed in Section 4 includes hyperparameters for which values have to be chosen before it can be compared against baseline methods.To ensure a fair comparison, values for these hyperparameters were selected by carrying out a grid search over the training portion of the CLEF 2017 dataset for three different levels of target recall: 1.0, 0.9 and 0.8.A single set of hyperparameters was used for all experiments.While it would have been possible to select a different set of hyperparameters for each dataset, potentially improving performance, doing so would have produced a less generalised model.
The following hyperparameters were included in the grid search: counting process model ∈ {Inhomogeneous Poisson (IP), Cox}, rate function ∈ {exponential, Power Law, Hyperbolic, AP-Prior}, threshold for NRMSE fit ∈ {0.05, 0.1, 0.15} and minimum number of relevant documents in the sample ∈ {10, 20, gradient decreasing}.The  and  parameters were not included in the grid search to reduce the computational cost required, and since altering them appeared to have limited effect on performance.Both were set to 0.025, values which lead the algorithm to check whether to stop at regular small intervals.Selecting the best hyperparameter values is not straightforward since the stopping problem has multiple objectives (i.e.achieving target recall while minimising the number of documents examined).For each target recall, the set of configurations that formed the Pareto frontier were identified.Hyperparameter values that appeared most frequently in this set were then chosen with the following selection as follows, counting process model: Inhomogeneous Poisson, rate function: hyperbolic, NRMSE threshold: 0.1, minimum number of relevant documents in the sample: gradient decreasing.The confidence parameter () was set to 0.95 for all experiments except those in Section 6.5.
Overall, the choice of hyperparameter had a limited effect on performance.The most significant ones were the choice of rate function (explored later in Section 6.2) and the minimum number of relevant documents in the sample.For the second of these, the flexibility of the gradient decreasing approach appeared to be useful to help the approach adapt to the varying number of relevant documents across topics.

Comparison with Alternative
Approaches. Figure 2 compares the performance of our approach against the various baselines described in Section 5.1.Results for the majority of methods are those reported by Li and Kanoulas [39].The exceptions are our own approach, the oracle, QBCB and adapted target methods. 14More detailed results, including additional metrics, are provided in Tables 6, 7 and 8 (Appendix B). 14 It is not possible to set the target recall to 1.0 using the adjusted target method (see Section 5.1.1).Instead, the target recall was set to 0.99 which restricted the number of relevant documents to a reasonable number (i.e.300).Increasing the target recall further would have required a larger number of relevant documents to be found, e.g. a target recall of 0.999 would require 2996, and often more than the number of relevant documents in the collection.Figure 2 shows the results for various target recalls (0.8, 0.9, 1.0) along each row with different datasets (CLEF2017, CLEF2018, CLEF2019, TREC Total Recall and TREC Legal) in each column.Performance of the oracle approach (shown as a blue circle) indicates the minimum number of documents that need to be examined to reach the target recall.The oracle's reliability is always 1.0 since this approach is guaranteed to achieve the target recall.

Manuscript submitted to ACM
Comparing the results over all configurations (datasets and target recalls), the proposed model (denoted by a cyan star) performs well in terms of balancing reliability and cost.It was able to achieve the target recall with high reliability and lower cost than other approaches the majority of the time.The proposed approach is also Pareto optimal in the majority of cases, and is the only Pareto optimal approach in two cases: Total Recall dataset with target recall 0.8 or 0.9 (Figures 2d and 2i).The only case where our approach is not Pareto optimal occurs with the TREC Legal dataset when the target recall 1.0 (Figure 2o).The reliability scores for our approach are low for this dataset.However, reliability scores are also low for several other approaches and the overall pattern of results is somewhat different compared with other datasets.The proposed model was very close to reaching the target recall of 1.0 for one of the two topics in this collection (the recall achieved was 0.999) and would also have been Pareto optimal in this case if it had been able to identify the last few relevant documents.Analysis of the rankings for this dataset showed that the majority of the relevant documents were found very quickly, but one topic also contained a long tail of relevant documents (leading to the high oracle cost when the target recall is 1.0, see Figure 2o).The hyperbolic rate function produced a reasonable fit Manuscript submitted to ACM to the true rate during the early part of this ranking but underestimated the rate at which relevant documents occurred later, leading to premature stopping.In fact, for this topic the proposed approach stopped at the earliest opportunity, after only the initial sample of 2.5% of the documents had been analysed, which could potentially have been avoided by increasing the initial sample size, effectively applying a heuristic to say that stopping should only be considered after a certain portion of the documents have been examined.Choosing a different rate function increased reliability on this data set, albeit at increased cost (see Section 6.2).
6.1.3Target Set Methods.The adapted target model and QBCB methods are also Pareto optimal in many cases.The approach used by these two methods is very similar since they both rely on a target set of relevant documents and stop when the last of these has been found in the ranking.These target set methods outperform the proposed approach in terms of reliability but not cost.In fact, the number of extra documents that have to be examined by this approach is often considerable and in some cases the entire collection.The most likely reason for this higher cost is that target methods do not take account of the fact that the likelihood of observing relevant documents decreases later in the ranking, leading them to sample high numbers of non-relevant documents.
The difference between the performance of the two target set methods is most pronounced for target recall 1.0.The adapted target method is reliable but has a very high cost, requiring all documents to be examined for most collections.
While the cost for the QBCB method is much lower, reliability substantially reduces (although results in Table 6 show that the recall is close to the target).
Figure 2 shows that the difference between the cost of these methods and the oracle varies between collections.It is highest for the smallest collections (the three CLEF collections) and lowest for the largest collection, TREC Legal.A possible explanation for this pattern is that larger collections provide the opportunity for more accurate estimates of the number of relevant documents during the random sampling used to create the target set.
Figure 2 also allows comparison of both the original and adapted versions of the target method for target recall 1.0 (represented respectively as a pink triangle and grey cross).The adapted target method is more reliable than the approach used by Li and Kanoulas [39].This is perhaps unsurprising since the statistical theory behind the approach requires an appropriate number of relevant documents to be found in order to provide theoretical guarantees about the recall levels achieved.On the other hand, the adapted version is more costly (due to the increased target size).

Comparison of Rate Functions and Point Processes
One of the goals of this work is to compare the various rate functions and point processes described in Section 3.3.This was explored by running the proposed approach with each rate function using both the Inhomogeneous Poisson and Cox processes while fixing all other hyperparameters to the values described in Section 6.1.1.
Table 1 shows results for a target recall of 0.9.(Similar patterns of results were observed for different target recalls.)Scores for the five metrics described in Section 5.2 are shown followed by standard deviation across topics in the relevant collection.(Note that standard deviation is not included for the reliability metric since, unlike the other metrics, it is defined across all topics in a collection rather than each topic individually.)Results show how the behaviour of the proposed approach varies according to the rate function applied.Statistical significance of the difference in performance of the four rate functions is shown in Table 2.
Overall, the hyperbolic decline rate achieved the target recall with minimal cost and highest reliability in the majority of cases.The power law rate function was the most reliable but also has the highest cost.Performance of the other two rate functions lies between that of the hyperbolic and power law.
Rate comparison recall cost reliability loss er RE H vs. P For the majority of the collections all variants achieve average recall above (or very close to) the target and achieve the target recall with high reliability.For the three CLEF collections, this is always achieved by examining no more than one third of the collection and in many cases substantially less.For the Total Recall collection, the recall and reliability are high (with near-perfect recall) with a very low cost.All relevant documents are identified while only examining (at most) 6% of the documents.However, the Legal collection presents more of a challenge to the approach and the effect of varying the rate function is more pronounced.The recall and reliability are higher for the power law but only at the cost of requiring an order of magnitude more documents to be examined.This difference in the performance of the various rate functions is likely due to the rankings produced for this collection (see Section 6.1.2) since the rate functions fitted to these have the potential to decrease very rapidly but the actual rate at which this happens depends upon the particular function being used.
Table 1 also highlights the similarity between the results produced using the Inhomogeneous Poisson and Cox processes.The differences between the results produced by the two processes were found to be statistically significant for all metrics with the exception of cost (paired t-test, p < 0.05).On average, the Inhomogeneous Poisson process achieved higher recall and reliability and lower relative error than the Cox process.Although the Inhomogeneous Poisson process had a higher cost than the Cox process, the difference between them was not statistically significant.
The Cox Process is more computationally expensive than the Inhomogeneous Poisson Process since the integral over the potential parameters of the rate function (see Section 3.2) cannot be expressed as a convenient closed form like the Poisson Process and, instead, is estimated using numerical integration.Given this trade-off, the Inhomogeneous Poisson Process may be preferable to the Cox Process in most circumstances and is used for the remainder of the experiments.

Performance Across Topics
An analysis of performance across individual topics was also carried out with the results for the CLEF 2017 collection shown in Figure 3. Results of the oracle method is shown in the top row (sub figures 3a, 3b and 3c) to provide context for the performance of the other approaches.The number of documents examined must be at least as high as the oracle cost to achieve the target recall.Results show that this number (indicated by a grey bar) varies considerably between topics when the target recall is set to 1 and almost all documents need to be examined for one topic (see Figure 3a).
There is also a substantial drop in this number for lower target recalls and for many topics examining fewer than 20% of the documents is sufficient to achieve a recall of 0.9 or 0.8 (see Figures 3b and 3c).
Each column in Figure 3 shows performance obtained using the four rate functions for target recalls 1.0, 0.9 and 0.8.These figures reflect the overall pattern of results for the CLEF 2017 dataset shown in Table 1.For example, the power law is reliable but also has higher cost than other rate functions.Figures 3d, 3e and 3f also show the algorithm is Manuscript submitted to ACM somewhat over cautious when this rate function is used since the cost is noticeably higher than for the oracle and other rate functions.In addition, there is little reduction in the number of documents examined when target recall is reduced.
The other rate functions also tend to overshoot the target recall, although to a lesser extent than when the power law is used.The hyperbolic rate function is the only one which fails to reach the target recall for some topics (i.e.reliability < 1) for this dataset.Topics where the target recall is not achieved tend to be larger ones (shown towards the right of each figure) with the amount to which the recall falls short of the target recall varying by topic (see Figures 3m, 3n and 3o).
The set of topics for which the achieved recall falls short of the target is similar across the rate functions, suggesting that some topics are more problematic for the proposed approach than others.Additional analysis was carried out on the three topics where the target recall was not reached when the hyperbolic rate function was used (CD011975, CD011984 and CD010339).In all three rankings, the last relevant documents in the ranking were preceded by long sequences of irrelevant documents causing the rate function to underestimate the probability of finding relevant documents later in the ranking which, in turn, caused the algorithm to stop before the target recall was reached.
Six of the topics in this dataset (around 20% of the total) have a small number of relevant documents (between 1 and 10) which causes the algorithm to overshoot for all rate functions since it requires a minimum number of relevant documents to be identified before the considering stopping (see Section 4.2).

Estimation of Number of Relevant Documents
The next experiment assesses the accuracy of the estimation of the number of relevant documents remaining.Although determining this value is not the main goal of our approach, observing it provides useful information about its behaviour.
The normalised difference between the actual and predicted number of relevant documents remaining is calculated according to the following formula: where  is the number of relevant documents remaining predicted by our approach and  is the actual number.The average of these values over all iterations using the IP-H approach is computed for each topic.Results are shown in Figure 4 where each collection is represented as a boxplot.
The figure shows that the normalised difference is relatively small for the majority of topics, indicating that the estimation of the number of relevant documents remaining is broadly accurate in the majority of cases.Where the estimates are not accurate the model tends to overestimate which provides some explanation for why it sometimes overshoots the optimal stopping point.However, this is preferable to undershooting since our aim is to develop a method where the target recall is achieved.
In some cases the overestimation is substantial, most notably in two topics in the CLEF2018 collection (CD011431 and CD008122).These topics were found to have a high prevalence of relevant documents and unusual patterns in the ranking where unusually high numbers of relevant documents appeared later in the ranking, meaning that the rate function fitted to the earlier part of the ranking did not provide a good indication of later behaviour.

Effect of Varying Confidence Levels
The proposed approach allows the confidence level () in the estimated total number of relevant documents to be varied.
Experiments were carried out using a range of values for  and target recall:  ∈ {0.8, 0.6, 0.4, 0.2} and target recall ∈ {1.0, 0.9, 0.8}.Results for the CLEF 2017 data set are shown in Table 3.The effect of varying the value of  is highest when the target recall is high and lowers as it is reduced.For a target recall of 1.0, reducing  leads to a reduction in the cost and reliability with only a small reduction in recall, indicating that the approach is less conservative in deciding when to stop examining documents.However, the effect of varying  is minor for a target recall of 1.0 and even smaller for lower target recalls.The reason for this limited effect is likely to be the steps taken to ensure that the algorithm does not stop too early because it has made predictions based on limited or unreliable evidence, e.g.few relevant documents or a badly fitted rate function (see Section 4.1), and may also be linked to the tendency to overestimate the number of relevant documents for some topics (see Section 6.4).
These results show that the reliability of our approach tends to exceed the confidence and remains high even when the confidence level is reduced.They also show that the  parameter in our approach should not be interpreted in the same way as the confidence guarantees in some previous stopping algorithms, e.g.[11,15,36,39], where it can be interpreted as the proportion of cases in which the target recall will be reached (i.e. its reliability).The link between this probability and the algorithm's behaviour is less direct in our approach, although it still provides a mechanism through which it can be influenced.

Performance on Multiple Rankings
The next set of experiments explore the effect of ranking effectiveness on performance.The proposed approach is applied to the set of rankings made available for the CLEF 2017 and CLEF 2018 data sets (see Section 5.4).Results were rankings.However, the most successful TAR methods are based on Active Learning approaches, e.g.[16,39], that also analyse documents in batches which determine the number of documents to be manually screened before re-training the classifier.These batches generally increase in size, for example in AutoTAR [16] the batch size, , is initially set to 1 and then increased by  10 each iteration.Our approach can be naturally adapted to this scenario by altering when the point process is applied to match the batches used by the CAL process.The next experiment explores the effect of making this change.
Our approach was adapted to follow the same batches used by AutoTAR [16].The IP-H model was used with the model applied to the ranking produced by AutoStop, which also follows AutoTAR batches.Results are shown in Table 5 where the figures in brackets indicate the difference between the corresponding scores obtained using uniform batch sizes.The overall results show that changing to dynamic batches tends to produce a small decrease in recall, cost and reliability.The drop in reliability is fairly substantial in some cases but the corresponding differences in recall indicate that the number of relevant documents identified was similar.It is worth noting that, although the batch sizes used by AutoTAR are well suited for CAL frameworks, they are not ideal for the stopping problem.AutoTAR batches are independent of the collection size and start very small then gradually increase in size but, since they dictate the set of candidate stopping points, this reduces the number of places at which the algorithm can stop later in the ranking.This mismatch is the likely reason for the reduction in performance when the AutoTAR batches are used.

CONCLUSION
This paper explored the problem of deciding when to stop examining documents in a ranked list so that a specified portion of the relevant ones has been identified.The proposed approach is based on point processes, which can be used to model the occurrence of random events over time, and are applied to model the rate at which relevant documents are encountered in a ranked list of documents.Two point processes (Inhomogeneous Poisson and Cox Processes) and four rate functions (exponential, power law, AP Prior and hyperbolic decline) were compared and evaluated using five data sets (CLEF Technology-Assisted Review in Empirical Medicine 2017-9, TREC Total Recall and TREC Legal).Experiments demonstrated that in the majority of cases, the proposed approach is able to identify a stopping point that achieves the target recall without requiring an excessive number of documents to be examined.It also performed well in comparison to a range of alternative stopping methods.Two of these alternative methods, the generalisation of the target and QBCB methods, were more likely to achieve the target recall than our proposed approach but at the cost of requiring more documents to be examined.Results also showed that employing different rate functions varied the behaviour of the proposed approach with hyperbolic decline leading to a balance between reaching target recall and the number of documents examined.Using the power law as a rate function was more reliable but required more documents to be examined.Results also showed that there was little difference in performance between the Inhomogeneous Poisson Process and the more computationally expensive Cox Process.Further experiments were carried out using a range of rankings of varying effectiveness.They demonstrated that the number of documents that need to be examined to reach a particular recall increases for less effective rankings.They also showed that the proposed approach remains reliable across a wide range of rankings when the power law rate function is used while the reliability tends to drop (often substantially) when other rate functions are used.

Discussion and Future Work
This work has demonstrated the importance of the ranking in stopping algorithm effectiveness.While this relationship is perhaps unsurprising it has, to the best of our knowledge, not previously been demonstrated empirically.This highlights a more general issue with the evaluation of stopping algorithms since previous approaches have been evaluated different rankings (not all of which are generally available) with each algorithm invariably being evaluated using a single ranking.
The community would therefore benefit from access to a common set of retrieval problems and rankings against which stopping algorithms could be evaluated.These rankings should include those generated by neural methods, which have recently shown promise for high-recall tasks [61,66].
The proposed approach models the number of relevant documents remaining using a Poisson distribution which has the highly restrictive assumption that the variance equals the mean.This can be problematic, particularly in situations where the estimated number of documents is high since the variance will also be high.Future work will explore ways to mitigate this limitation.
Another potential avenue for future work would be to integrate a classifier into the stopping algorithm, similar to Yang et al. [63] and Yu and Menzies [69] (see Section 2.2).The classifier could be trained using the relevance judgements available from the trained part of the ranking and then applied to the unobserved part.Its output would provide information about the likelihood of those documents being relevant that could then be used by the point process to improve the estimate of the number of relevant documents remaining.
In common with previous work on stopping methods for TAR, the work described here focuses on the problem of achieving a specified target recall, i.e. identifying a set portion of the relevant documents.However, recall does not take account of the effort required to identify relevant documents which can vary considerably depending on their prevalence.An alternative approach to developing stopping algorithms could be to continue until the effort required becomes excessive.A potential method for assessing effort is available from the field of systematic reviews where the number needed to read metric measures the number of documents that need to be examined in order to find a single relevant one, i.e. reciprocal of precision [12].In addition, the recall achieved is often less important than whether an information need has been met.For example, in medicine Diagnostic Test Accuracy systematic reviews aim to quantify the effectiveness of medical tests (in terms of specificity and sensitivity).Norman et al. [45] developed stopping criteria based on the reliability (or variance) of these estimates, rather than when a specified proportion of the evidence has been identified.Another possible route for future work would be to extend the approaches described in this paper to estimate the amount of information remaining, and the possibility that it would alter the conclusions that had been drawn from the documents examined so far.
Finally, work on stopping methods, including the approach presented here, relies on the assumption that relevance judgements provided by assessors are reliable and consistent.However, it has long been known that this is not the case, e.g.[9,33,52], which could have a significant effect on stopping algorithms since their decisions may be based on relatively small numbers of relevant documents.Exploring the relationship between relevance judgement consistency and the effectiveness of stopping algorithms represents an interesting direction for future work.

Stevenson and Bin-Hezam
For this to be true there must be  relevant documents not included in  .This probability of which can be estimated as 15   1 − Combining this with 26 produces

𝑡 𝑟
. 15 This proof follows Cormack and Grossman [15] in modelling this probability using a Binomial distribution, i.e. assuming sampling with replacement.Sampling without replacement would arguably be more appropriate since it is unlikely that a document would be examined for relevance more than once but we choose to follow the previous approach as closely as possible.

Consider a simple illustrative
example where we wish to estimate the number of relevant documents between ranks 10 and 100 where  = 0.05.Then  ( (10, 100) = ) =

Fig. 1 .
Fig. 1.Representation of a Point Process applied to a ranked set of 12,807 documents of which 114 are relevant.The figure is divided into two parts by the vertical line just below rank 4000.Documents to the left of this line have been screened for relevance and the figure shows the cumulative number of relevant documents identified at each point of the ranking.Documents to the right of the line have not yet been examined and the figure illustrates a Point Process used to estimate the number of relevant documents.The shaded area represents the number of relevant documents predicted by the Poisson Process in the 5% to 95% confidence range.Taking the upper bound of this estimate for the final document in the ranking produces a prediction of the total number of relevant documents in the collection.

Fig. 2 .
Fig. 2. Cost vs. reliability for a range of approaches on multiple datasets.Pareto optimal points are linked by a grey line.

Fig. 3 .
Fig. 3. Details of performance for each topic for CLEF 2017 collection.For each topic grey bars indicate the cost and black line represents the recall.The dotted horizontal line indicates the target recall.Topics are sorted by the number of documents they contain (ascending from left to right).

Fig. 4 .
Fig.4.Distribution of normalised differences between actual and predicted number of relevant documents remaining using IP-H.Boxes extend between the first and third quartiles with the median indicated by a green line.The whiskers expand the box by 1.5 × (Q3 -Q1), i.e. 1.5 times the inter-quartile range.Outliers beyond this range are indicated by circles.

𝛼𝑟 = 1
3.1.1Homogeneous Poisson Processes.The simplest type of Poisson Process, a homogeneous Poisson processes, is produced when the value of  is constant.Then, the number of relevant documents that has occurred at point  in the ranking,  (), is modelled by a Poisson distribution with the parameter , i.e. the probability that  relevant documents have been observed by rank  is given by [20]processes[20], also known as doubly stochastic Poisson processes, are an extension of Poisson Processes that take account of uncertainty about the rate function.Rather than being a fixed function, as in a Poisson Process, the rate function in a Cox Process is modelled as a probability distribution over possible rate functions,  ().The random variable representing the number of relevant documents that occur between ranks  and  is then defined by computing the expected value of 6 given  (), i.e.

Table 1 .
Comparison of performance of rate functions for 0.9 target recall.↑ and ↓ indicate metrics where higher and lower scores are preferred (respectively).IP = Inhomogeneous Poisson Process, CX = Cox Process, P = power law, H = hyperbolic, E = exponential and A = AP Prior, e.g."CX-P" indicates a Cox Process with the power law rate function.

Table 4 .
Averaged performance over multiple runs over CLEF 2017 and CLEF 2018 collections.All differences between IP- * and OR results are statistically significant (paired t-test,  < 0.05) with the exception of those indicated by an asterisk ( * ).

Table 5 .
IP-H Model Performance Following AutoTAR Batches.Figures in brackets indicate difference between using uniform batch sizes.