Estimates of Temporal Edge Detection Filters in Human Vision

Edge detection is an important process in human visual processing. However, as far as we know, few attempts have been made to map the temporal edge detection filters in human vision. To that end, we devised a user study and collected data from which we derived estimates of human temporal edge detection filters based on three different models, including the derivative of the infinite symmetric exponential function and temporal contrast sensitivity function. We analyze our findings using several different methods, including extending the filter to higher frequencies than were shown during the experiment. In addition, we show a proof of concept that our filter may be used in spatiotemporal image quality metrics by incorporating it into a flicker detection pipeline.


INTRODUCTION
Models of human vision [ 47 ] are the basis of several popular image [ 2 , 5 , 25 ] and video quality metrics [ 24 , 38 ].Such metrics can be used to compare competing rendering algorithms, to analyze compression quality [ 51 ], or to allocate resources in offline or real-time rendering [ 15 , 21 ].A cornerstone in many of these models is the family of contrast sensitivity functions (CSFs) [ 3 ] that describe the relative response of the visual system to a periodic spatial or temporal stimulus.Early CSF models described sensitivity as functions of spatial and temporal frequencies only [ 18 , 34 ].Since then, researchers, for example, Rovamo and colleagues, have produced spatial models that incorporate the background luminance [ 29 ], the size of the stimuli [ 35 ], as well as the eccentricity at which the stimuli are viewed [ 36 ].Other work consider how human contrast sensitivity depends on the temporal frequency and background luminance of the stimulus [ 48 ].However, CSF models handling all five dimensions have been proposed only recently [ 23 , 24 ].
Edge detection processes in the human visual system help us determine the shape of objects by finding their boundaries.Therefore, edge detectors find use in several areas of image processing [ 2 , 20 , 43 , 52 ].In spatial domain image processing models, we see use of both spatial CSFs and edge detectors.Similarly, temporal image processing models use temporal CSFs.To our knowledge, however, no such model has used temporal edge detectors.In this paper, we attempt to fill this gap by estimating the temporal edge detection filter in human vision.Our estimated filter is acausal.However, for our target application-a full-reference spatiotemporal image metric-acausal filters are applicable (Section 4.2 ).
We conducted a psychophysical experiment to measure and model the shape of the visual system's temporal edge detection filter (Section 3 ).Our process was analogous to McIlhagga's spatial work [ 27 ]; participants were asked to mark the point in time when they perceived a luminance edge embedded in a sequence of frames whose luminance otherwise followed a Brown noise distribution (Figure 1 ).User responses were used to estimate the shape of the temporal edge detection filter of the visual system (Section 4 ).In Section 5 , we examine how well our estimated filters generalize to higher temporal frequency content and provide an example use case for our filter.Finally, we offer some conclusions and ideas for future work in Section 6 .

PREVIOUS WORK
Metrics.Image and video quality metrics are important in fields such as computer graphics, compression, computer vision, and machine learning.For example, in computer graphics, comparing the metric outputs of two approximative rendering algorithms can help decide which algorithm would be preferred by human consumers.There are several existing image and video metrics.Kazmierczak et al. [ 17 ] provide a detailed review of those.Here, we will only discuss two examples.For rendering, F LIP [ 2 ] is a metric that attempts to estimate the error that a user perceives when alternating between a reference image and a test image, which is a common viewing protocol in the rendering community.F LIP also uses the human spatial edge detection filter [ 27 ] to increase the reported error when it is caused by a difference in edge content, e.g., when an edge can be observed in one of the images but not the other.A recent video metric is FovVideoVDP [ 24 ], which models spatiotemporal aspects of perception.It also takes peripheral vision into account, which is important for wide field-of-view displays, e.g., for AR and VR.However, it does not include an explicit edge detector.Our long-term goal is to extend F LIP to handle image sequences.A likely component of such an extension is the human temporal edge detection filter, estimation of which is the target for the work in this article.Because we aim to make the extended metric useful at several different frame rates, the evaluation in this article emphasizes examination of the estimated filters' ability to generalize to various frame rates.Our evaluation suggests that the filters do generalize well and our proof-of-concept study (Section 5.2 ) indicates that they may indeed benefit spatiotemporal image quality metrics.
Edge Detection.Edge detection in images is a common tool used for image alignment, correspondence matching, and object recognition, among others.Standard algorithms are described in Szeliski's book [ 43 ], but there are also more advanced techniques such as subpixel accurate edge detection methods [ 52 ] and transformer-based models [ 33 ], the latter of which is currently considered state of the art for edge detection in images.Other works investigate edge detection from a perception point of view [ 22 , 27 , 44 ].Schmittwilken and Maertens [ 37 ] present a spatiotemporal edge detection framework augmented with a model for fixational eye movements, taking into account that our eyes are constantly moving.Van Hateren and Ruderman [ 45 ] use independent component analysis on natural video sequences and generate spatiotemporal filters whose results are similar to those found in the primary visual cortex.Furthermore, because edges are salient features in visual stimuli, detection of them is often a part of saliency predictors [ 20 ].
McIlhagga [ 27 ] describes a method to derive a spatial edge detection filter based on human observations.On a high level, in McIlhagga's method, users observe images consisting of a stack of horizontal lines of varying luminance.The luminance of each line is determined by the sum of two signals: a randomly placed edge (Heaviside function) and Brown noise.Brown noise is used as that is what is often observed in natural images and the optimal edge detectors in such noise are localized [ 26 ], meaning that their second moment is finite.In the experiment, the users' task was to mark the line on which they perceived the edge.The assumption, an analogy of which we adopt in this work, was that the probability that an observer marked a certain line in the image depended on the response of the edge detection function in their visual system to that line.
Our experiment, described in detail in Section 3 , is similar to McIlhagga's, except for a few key differences, the main one being that our stimulus is a sequence of uniformly gray patches of different luminance in time instead of an image of gray lines of different luminance.This results in the estimation of temporal edge detection filters.

METHODOLOGY
In this section, we first describe the experiment stimulus (Section 3.1 ), then the task (Section 3.2 ), followed by details about the participants and the experimental setup (Section 3.3 ).

Stimulus
Participants observed a uniform gray patch on screen, whose luminance varied in time following a Brown noise distribution, into which a randomly placed temporal edge was inserted.This was motivated by the fact that the temporal frequency content found in natural image sequences with low spatial frequency is similar to that of Brown noise [ 9 , 41 ] and that it leads to localized edge detectors [ 26 ].The Brown noise was generated by taking the cumulative sum of white noise samples independently drawn from a zero-mean Gaussian distribution with standard deviation σ = 0 .02 in contrast units.Here, contrast is defined as relative luminance, i.e., the contrast of a patch with luminance L is (L − L background )/L background , where L background is the luminance of the background.The standard deviation was chosen such that the Brown noise sequence generally contained several perceivable intensity edges which were small enough to avoid uncomfortable levels of flicker.
To generate a stimulus, a temporal edge was synthetically added to the Brown noise sequence, as illustrated in Figure 1 .The edge was inserted at a randomly chosen frame number, ˜ I , from the set c r ˜ n f , . . ., (1 − c r ) ˜ n f .In our setup, the number of frames in the sequence was ˜ n f = 100 and the factor c r = 0 . 2 .This factor avoids inserting the edge in the first 20% and in the last 20% of the frames.At the randomly chosen frame ˜ I , we add or subtract a predetermined step-change contrast value c, to the cumulative sum in addition to adding a white noise sample, i.e., we generate a sequence , containing an edge of height c embedded in Fig. 2.An example of a sequence, b , determining a stimulus shown in our experiment (solid blue line).The relative luminance is a number between −1 .0 (black) and +1 .0 (white), where 0.0 represents the background luminance, set to the luminance that is in the middle between black and white.The y-axis was cut at [0 .50 , 1 .00 ] for visibility.The dotted gray line shows the relative luminance level at which the sequence starts and ends.Notice that the user marked a frame earlier in the sequence than the inserted edge ( ˆ I < I ).This is due to motor error (Section 4.1 ), making it hard to mark the edge exactly despite observing it.Because |I − ˆ I | < 5 , the user was deemed to have marked the inserted edge in this trial.
Brown noise, as where is the Kronecker delta function, i.e., δ (0 ) = 1 and δ (x) = 0 otherwise, and the ± is either + or − depending on the outcome of a fair coin flip.After generating the sequence ˆ b , we remove its mean, b and add a random shift, Δ, drawn from a uniform distribution.This random shift was added so that the we would include sequences of low, medium, as well as and high average luminance, rather than only medium average luminance which would otherwise have been the case.The uniform distribution was chosen such that no values in the shifted sequence would be outside the [−1 , +1 ] range after the shift was applied.Note that −1 represents minimum luminance and +1 represents maximum luminance in PsychoPy [ 31 ], which was the framework we implemented the experiment in.If the largest element of the sequence was greater than +1 or the smallest was below −1 , we would generate new candidate sequences until we produced a valid one.Now, the shifted sequence is In our experiment, we show the sequences repeatedly.To avoid a possibly large intensity jump when the sequence restarts, we add a linear ramp of length 2 n r = 20 frames going from the last to the first value in the sequence.The final stimulus, b , is produced by inserting half the ramp at the start and half of the ramp to the end of the intermediate sequence, ˜ b .Given a sequence, the user's task is now to find the location, I = ˜ I + n r , of the added edge of height c. Figure 2 shows the values of an example sequence.The sequences are shown at 60 frames per second.With n f = ˜ n f + 2 n r = 100 + 20 = 120 frames, this implies a sequence length of 120 /60 = 2 .0 seconds.In Figure 3 , we show the window that is displayed to the user during the experiment.The background (1) color was set to middle-gray (0 in the PsychoPy RGB color space).The luminance of the stimulus ( 2 a and 2 b) at frame i was determined by the value of the generated sequence at the same frame, i.e., b i .As many displays are unable to provide the same intensity over the entire screen, we showed the stimulus on only a part of it, namely in the form of a square whose sides corresponded to half of the participant's display's height.The stimulus was split in two equal parts ( 2 a and 2 b) by a green progress bar (3), which started at the left border of the stimulus, whose length increased uniformly for each frame shown, and which reached the right border of the stimulus after n f frames.After that, the progress bar and the sequence reset and were shown again.The color beneath the progress bar was always the same constant gray as the background.The yellow line (4) could be moved by the user via either the mouse or the LEFT/RIGHT arrow keys of the keyboard.For our method, the most information about the filter comes either from stimuli with a low-contrast stepchange for which the observer gets it correct, or stimuli where the observer gets it wrong because part of the noise was detected as a step-change by the observer.High-contrast steps provide little information since almost any reasonable filter could predict observer responses in that case.Ideally, for estimation, we should present the stimulus at low levels of detection, but this is frustrating and the observer's model of the stimulus can drift.We employ a staircase procedure [ 4 ] to ensure that the observer is reminded of what they are looking for when this occurs.In particular, the edge height, c, is chosen adaptively during the course of the experiment.At the start of the experiment, its value is c = 0 .8 .We then apply staircasing to it, decreasing the value by 20% if the user marked the inserted edge in two consecutive trials, and increasing the value by 25% if the user did not mark the inserted edge.The user was deemed to have marked the inserted edge if the marked frame, ˆ I , was within τ = 5 frames of the edge frame, I , i.e., if | ˆ I − I | < 5 .The choice of τ = 5 frames was made arbitrarily, but in a way that made the task reasonably hard.A smaller number would lead to the inserted edge being marked less frequently, thus keeping the edge height large during the trials and the more informative, low-contrast edges being displayed in fewer sequences.On the other hand, a larger number would require less precision from the users, possibly leading to increased noise.
In each run of the experiment, the participant was shown a randomly generated set of n t = 30 sequences, with the embedded edge's height, c, generated using the staircase procedure explained above.In our main study, each participant ran the experiment three times.During the experiment, for each sequence shown to a participant, we stored the user-marked frame index, ˆ I , together with the true edge frame index, I .Thus, we stored an array of the user-marked indices, ˆ I = { ˆ I 1 , ˆ I 2 , . . ., ˆ I n s } and an array for the true edge frame indices, I = {I 1 , I 2 , . . ., I n s }, where n s = 3 × n t × n p is the total number of trials in the experiment, with n p being the number of participants.These were used to estimate temporal edge detection filters, as described in Section 4 .In addition, we collected the time it took for the users to complete each trial.Data for individual users, including times and their accuracy, is presented in Appendix A .

Task
Participants were asked to mark the time of the largest edge (intensity jump) in each sequence.This was done at the same time as the sequences were presented.The yellow bar (marker; see Figure 3 ) was moved using the mouse or the LEFT/RIGHT arrow keys of the keyboard such that the time at which the progress bar hit the yellow line coincided with when the participant saw the largest intensity jump in the stimulus.If the sequence was shown only once, the task would put requirements on the participant's memory as they would have to place the marker after finishing watching the entire sequence, which would likely have introduced severe noise levels in the markings.Instead, the sequences were shown repeatedly during the trials and the participants were allowed to observe the sequences, and move the marker, as many times as they found necessary.As the stimuli used in the experiment vary in time and therefore is difficult to convey with an illustration, we also provide a video showing the experiment in our supplemental material.Notice that, because users not only choose one frame but also reject multiple others, our task may be considered to be N -alternative forced choice ( N -AFC) , where N = n f is the number of frames in the sequence.An alternative version of our experiment could consist of a 2-interval forced choice (2-IFC) task, where users are asked to mark whether the edge was present in the first or second part of the sequence.An analogous approach (2-AFC) was used successfully by McIlhagga and Mullen in showing evidence for chromatic spatial edge detectors in human vision [ 28 ].We choose the N -AFC version for our experiment as it provides more information per trial for our analysis, compared to the 2-IFC approach.In addition, results by Jäkel and Wichmann suggest that inexperienced observers, such as those who participated in our study, perform better with more alternatives [ 14 ], although Jäkel and Wichmann's study does not use as many alternatives as ours.

Participants and Experimental Setup
In our main user study, all stimuli were displayed on a Dell UP3216Q monitor.The surrounding environment was dim, with no lights directly facing the display.Linearization was done through an inverse gamma transform, estimated using a ColorChecker Display Pro colorimeter and the DisplayCAL software.The whitepoint was set to D65.The stimuli were presented at 60 frames per second.Users were positioned 0.7 meters from the display, implying that the stimuli (Figure 3 ) covered approximately 16 degrees of visual angle.
To encourage the participants to perform as well as they could throughout the experiment, we implemented a scoring system.Participants were awarded one point for each correct answer, i.e., when their marker was put within τ = 5 frames from the frame corresponding to the inserted edge, while they lost two points for each incorrect answer.Their score, and whether or not they had answered correctly, were shown after each trial.The score screen also acted as a resting period between trials.Before their first trial, participants were asked to read a consent form and consent to participate, and were shown a short tutorial video explaining the task.
In total, n p = 10 participants, aged 25-56, took part in the main experiment.The majority of the participants were computer graphics and computer vision experts.Participants ran the study three times each, resulting in n s = 3 × 30 × 10 = 900 trials.The trials were authorized by the Centre for Mathematical Sciences at Lund University.In addition, a separate study was carried out online in Sweden and the United Kingdom (UK) , including a larger number of participants.The UK trials were approved by the Biomedical, Natural, Physical and Health Sciences (BNPHS) Research Ethics Panel at the University of Bradford.The results of the online experiment are presented in Section B.2 of Appendix B .Our main experiment was conducted using a photometrically calibrated monitor with a fixed surrounding environment and a set distance between monitor and participant.However, for the online version of the experiment, those constraints were not guaranteed.

FILTER ESTIMATION
In this section, we estimate the temporal edge detection filter in human vision based on the experimental data collected as described in Section 3 .First, we explain the theory of the filter derivation (Section 4.1 ), then we discuss the limitations of our methodology (Section 4.2 ) before presenting the resulting estimates in Section 4.3 .

Theory
Suppose that, on trial s, the observer views the stimulus represented by the vector b s = {b s1 , b s2 , . . ., b sn f } (Section 3.1 ), where n f is the number of frames in a trial.We assume that observers use a linear filter, where n l is the filter's length, to detect temporal edges in the stimulus so that, on the s:th trial, the observer convolves the stimulus with the filter to yield a vector of filter responses as If the i:th element of this response, r si , has a large magnitude, that indicates there might be a step-change in luminance at frame i.Because the step-change could be either negative or positive (Equation ( 1)), i.e., going from brighter to darker or darker to brighter, the large value in the filter response could be either negative or positive, so we assume the observer is looking for large values in the absolute response | r s | rather than r s .The observer does not necessarily decide that the step-change occurred at the frame i which maximizes | r si | , because of internal noise.Instead, we assume that the probability p si that the observer marks frame i as containing the step-change is given by the soft-max of the filter responses, i.e., However, even when observers can see the step-change clearly, they may not always be able to move the marker to exactly the right place on the progress bar.This motor error may be due to personal errors, including mechanical proficiency, which have been shown to have a significant impact on other temporal marking-based experiments [ 10 ], as well as the marking procedure possibly being affected by the flash-lag effect [ 16 ].Early versions of our experiment included sound cues, but these were later removed as the personal errors induced by the combination of auditory and visual stimuli could further increase the motor error [ 10 ].The motor error, which we assume is independent of the filter, was modeled by a Gaussian with mean μ m and standard deviation σ m frames.We estimate the motor error by considering the user markings for sequences where the edge height is largest ( c = 0 .8 ).In these there was always one edge that was significantly more prominent than the others, so the errors in the users' markings, compared to where we had inserted an edge, were largely due to the motor error.For details about the motor error estimation procedure and how the motor error affected the estimated filters, see Appendices C and D , respectively.The final probabilities for the responses is thus the convolution of the filter probabilities p s = {p s1 , p s2 , . . ., p sn f } with a motor error Gaussian m = {m 1 , m 2 , . . ., m 2 r }, yielding and where r = 3 σ m is the radius of the motor error kernel.The log-likelihood of the observers' responses over the entire experiment is thus where ˆ I s is the frame that was marked in the s:th trial and n s = 3 × n p × n t is the total number of trials, with n p and n t being the number of participants and trials per participant, respectively.The degrees of freedom in this log-likelihood are the filter values f, and so a maximum-likelihood estimate of the filter can be obtained by optimizing ˆ L with respect to the filter values.Unfortunately, unconstrained optimization typically produces filter estimates that are noisy and difficult to interpret.Thus, we penalized the fit of the filter with a smoothness regularizer where the three middle indices in f were excluded in S in order to not penalize sharp edges at the center of the filter and zeros were padded to either end of the filter to push the endpoints of the filter toward zero.In addition, 7:8 • P. Ebelin et al.
we include a penalty for non-odd filters, because early results indicated that the sought filter may be odd.This loss, O, was given by Finally, we optimized the filter elements to maximize Values of ζ , ξ , and the filter length n l were chosen through four-fold cross-validation.Appendices D and F contain further details about the filter estimation and cross-validation results.We close the section by summarizing the key differences between our and McIlhagga's optimization [ 27 ].First, we consider both dark-to-bright and bright-to-dark step-changes, while McIlhagga considered only the dark-to-bright case.This was accounted for by using the absolute value of the filter response in Equation ( 3 ).Second, McIlhagga accounted for the motor error only after having approximated the filter with a derivative of a Gaussian whereas our solution incorporated the motor error into the optimization process.Consequently, we account for the motor error independently of the filter type.Third, we included the mean of the motor error Gaussian, allowing us to account for possible biases in the user markings (Equation ( 4) and Appendix D ).Fourth, our smoothness penalty excluded the middle indices in the loss function sums (Equation ( 6)), allowing for sharp edges at the center of the filter.Finally, we included a penalty for non-odd filters.

Limitations
The validity of our results are conditional on the validity of the model.While our model is an oversimplification of the human biological process, it is simple and can account for the results.
We note that the estimated filters are not causal, a consequence of that we could not conceive an experiment procedure which allowed marking the time of the edge accurately and rely purely on a causal filter.This could pose problems when attempting to control a system in real time based on the outcome of the edge detection filter, since future information would not be available.A possible solution would be to make the filter causal by shifting it back in time by an amount corresponding to its radius.This would add a lag to its response, a compromise that may be acceptable in some applications.In our future target application-a full-reference spatiotemporal image metric-the acausality is not an issue.The reason for this is that such a metric compares a ground-truth reference image sequence to an approximative one.Due to the high-quality requirement of the former, the sequences are generated offline as each reference frame could require minutes, or even hours, to create.Therefore, users of the metric may assume that the full image sequences are available, implying that future information is accessible and an acausal filter may be used.
Another item worth consideration is the situation when the stimulus contains two or more edges close in time.Since Stroud's perceptual moment theory [ 42 ], there has been an ongoing debate on the discrete vs. continuous nature of temporal perception [ 1 , 39 , 46 ].Our experiment does not contribute to this topic; however, the filters we derive are more compatible with the "travelling moment" [ 1 ] theory.As discussed in the mentioned literature, similar events occurring closely in time, such as multiple edges in our experiment, could be perceived as one.This may have increased the measurement noise in our collected data.The measurement noise is likely further increased by temporal persistence [ 8 , 50 ].We account for these factors as part of our "motor error" (Section 4.1 ).

Results
In this section, we present four different estimates of the temporal edge detection filter in human vision: t all , d all , t low , and c all .They are estimated using either different models or, in the case of t low , using only a subset of the data collected during the main user study.
Fig. 4. The temporal edge detection filter, t all , estimated with the full set of results from the main user study.The smoothness parameter was ζ = 0 .005 , the non-oddness parameter was ξ = 0 . 1 , and the filter length was n l = 21 samples (Equations ( 6)-( 8)), while the motor error shift and scale were estimated to μ m = 0 .18 and σ m = 2 .22 frames, respectively.The light blue area indicates the 68% interval [ 11 ] of the filter.The asymmetry of the area relative to the solid line is a consequence of the interval used (see Section 4.3 ).
First, we present t all , referred to as a "Free filter" as all elements are free parameters in the optimization (Section 4.1 ).The filter is shown in Figure 4 .
In our four-fold cross-validation, where 20% of the trials had been excluded for later evaluation, we found that smoothness parameter ζ = 0 .005 , non-oddness parameter ξ = 0 . 1 , and filter length n l = 21 samples yielded the highest scores (see details in Appendix F ).The motor error shift and scale were estimated to μ m = 0 .18 and σ m = 2 .22 frames, respectively.Those motor error distribution values were used to estimate the filters presented in this section.We note that the small value of μ m indicates minimal effect of flash-lag [ 16 ] on our results, although observers may be compensating.
In Figure 4 , we also present the bootstrapped 68% interval [ 11 ] of the filter, which is equivalent to ±1 standard error for a normal distribution.The interval was computed by sampling 100 sets of n s trials by drawing from the original set of trials, with replacement, estimating one filter for each set, sorting the bootstrapped filters' values at each index and taking the corresponding 16-and 84-percentiles as the lower and upper bound of the interval, respectively.As seen in Figure 4 , the resulting interval has a shape similar to t all , indicating that, while the filter may vary depending on data, its shape can be assumed to be similar to our estimate.
We note that the shape of the t all filter is similar to that of the derivative of the infinite symmetric exponential function (DISEF) , which has been found to be the optimal edge detection filter under certain criteria [ 26 , 40 ].The DISEF is defined as Based on the above, we fit a DISEF filter ( a d and s d in Equation ( 9), as well as the filter radius, r d ) to our data, by letting the filter f in Section 4.1 be a DISEF, with free parameters a d and s d , and then following the optimization procedure also described in Section 4.  The criteria under which the DISEF is optimal for edge detection are (i) when the edge is embedded in white and Brown noise and (ii) when a high signal-to-noise and localized filter is desired [ 26 ].Our setting agrees to a large extent with these criteria, meaning that the Free filter we have estimated ( t all ), being similar to a DISEF function, might just be the theoretical optimum and not necessarily related to human edge detection.To investigate this, we estimate another filter, denoted t low , for which we only consider the subset of trials for which the step-change contrast, c, was low, as those are the trials which contain the most information about the human edge detection filter, as was noted in Section 3.1 .During optimization, we used the same hyperparameters and motor error estimate for t low as we did for t all .
To choose the subset of trials containing only those with low step-change contrasts, we need to decide what we mean by "low" step-change contrast.For this purpose, we examine how the participants' accuracy, i.e., the ratio of correct markings, depends on the difficulty of the trial, as measured by the height, c, of the edge.Figure 5 shows this relation in a bar plot.Remember that the step-change contrast was lowered by 20 % after a participant answered correctly in two consecutive trials and increased by 25 % whenever they answered incorrectly.With a maximum of c = 0 .8 , this scheme yielded edges of height c = 0 .8 h , where h = {1 , 2 , . . .}.In Figure 5 , we include the number of times different contrasts were used in trials.All participants, independent of how they did in the experiment, had their trials included in the dataset.For the four largest step-change contrasts, the accuracy is almost constant and at a high level, after which we see a small decrease toward lower contrasts.Between c = 0 .13 and c = 0 .11 , the accuracy decreases more significantly, and we interpret this as the point where trials are challenging for most users and step-change contrasts are low.This is further indicated by the fact that the subsequent, lower step-change contrasts were used significantly fewer times.As such, we use all data with c ≤ 0 .13 to estimate the filter for low step-change contrasts, t low .In total, this dataset contained n low s = 7 + 24 + 56 + 84 = 171 trials.In Figure 6 , we show the Free filters together with the DISEF filter.Like the t all filter, the filter for the low step-change contrast dataset also resembles a DISEF.This suggests that the DISEF-like behavior of our estimated filters is not just due to it being optimal in the experimental setting, but also that the temporal edge detection function in our visual system behaves similarly.As the two Free filters are similar, we will only include t all in our further analysis.
For the final filter estimate, we consider adapting the temporal contrast sensitivity function (TCSF) .When humans look at a stimulus undergoing periodic change in time, the amplitude of the change (expressed using the Fig. 6.An illustration of the filters estimated from our main user study data.The blue, solid line depicts the filter consisting of only free weights ( t all ), the red, dashed line shows the DISEF estimate ( d all ; Equation ( 9)), the khaki, dot-dashed line shows the TCSF estimate ( c all ), and the gray, dotted line shows the estimate using only data with low step-change contrast, t low .contrast) and the frequency of the periodic change ( f ) together determine whether the observer will perceive the flicker.For each f , there exists a threshold contrast, c t , below which flicker is not perceived.The TCSF plots sensitivity ( 1 /c t ) as a function of frequency f .Worth noting is that unlike edge detectors, contrast sensitivity functions do not capture phase information.Because of its relation to temporal edge detection, we compare our estimated filter to the TCSF.As our filter is defined in the time domain, while the TCSF usually is given in the frequency domain, as noted above, we need to transform the TCSF to the time domain.We chose the TCSF to be the transient channel of the TCSF in stelaCSF [ 23 ] that captures high-frequency information, R T (see Equation ( 12 ) in the stelaCSF paper) as our stimuli mainly contain high frequencies.We first sample R T in the frequency domain.While the TCSF does not contain phase information, we assume, based on early observations of the estimated Free filter's shape, as well as that of spatial edge detectors, that the human temporal edge detection filter is odd.Under that assumption, we can retrieve an odd, time-domain version of the TCSF by applying the inverse Fourier transform to the sampled R T .We then interpolate the time-domain version to retrieve a filter corresponding to 60 Hz.After the transform, the TCSF filter is defined between t = −1 and t = 1 second.We reduce its length by removing parts of its tails.The length of the TCSF filter is set to 41 samples, which results in approximately 96% of its energy being preserved.Finally, we note that the TCSF's values are relative and that the filter's amplitude, a c , is a free parameter.We optimized the amplitude to fit our data, similar to how we optimized the DISEF filter parameters, again with ζ = ξ = 0 .The optimization yielded a c = 1932 .The resulting time-domain filter is c all .For details on how the filter was generated, see Appendix E .
Like the Free and DISEF filters, the TCSF filter is also presented in Figure 6 .We note that the filters' amplitude affects the sharpness of the probability distributions (Equation ( 3)), implying that larger amplitudes indicate a more confident filter.We see that the TCSF filter has a smaller amplitude than the other filters.This may be due to the small overshoots it has for times |t | > 7 /60 seconds.If the amplitude is increased during optimization, the overshoots would also increase, possibly causing larger loss values.It is noteworthy that the filter estimated with data containing only low contrast step-changes has the largest amplitude despite being based on more difficult sequences.The reason may be that the amplitude of the estimated filter can increase depending on the subset of data used for estimation, as seen in Figure 4 .Finally, the filter lengths vary, though the tails of the Free filters The ↑ is used to indicate that a higher number is better and the best result is written in bold.Here, the three filters performed similarly.
likely mainly contain noise.In the next section, we will see that the longer filters do not necessarily perform better than the shorter DISEF filter in applications.
To conclude this section, we show, in Table 1 , the values of the average log-likelihood, computed on the set of test trials that were excluded from training and cross-validation and using the filters introduced above.The filters were trained using all training data.We note that the three filters perform similarly on the test data.

EVALUATION
In this section, we evaluate our estimated filters.First, we examine how well the filters generalize to sequences shown at 120 Hz (Section 5.1 ).Second, we investigate the filters' usefulness in a flicker-detection model (Section 5.2 ).The results of these investigations serve as an implication of the filters' usefulness in applications.

Generalization to Higher Temporal Frequencies
Our main experiment was done on monitors showing stimuli at 60 frames per second (FPS) , implying that our fit is to such data.Here, we investigate if our estimated filters generalize to higher frequency stimuli.In particular, we consider sequences shown at 120 FPS, as this frame rate may generate stimuli that are close to the flicker fusion threshold of approximately 60 Hz [ 6 ].For the investigation, we employed the double-pass method [ 12 ], which entails having participants run the same experiment twice and examining how their answers differed between the two runs.The result gives an indication about the intra-person variance.We can compare that variance to the difference between our model's output and the human responses.If the two are similar, it would imply that the model is able to predict human answers.In our particular case, such similarity can be interpreted as evidence of how well our filters generalizes to 120 FPS stimuli.
For our study, one author ran the experiment 20 times and then another 20 times again with the same sequences.The stimuli (including the ramp) for these 40 trials were changed to include twice the number of frames, while the replay speed was increased from 60 to 120 FPS, again resulting in sequences with a period of 2.0 seconds.These trials were run using a linearized ACER XB272 monitor, with the distance to the monitor adjusted to have the stimuli cover the same 16 degrees of visual angle as in the main experiment.To obtain identical stimuli for the first and second sets of trials, the staircasing for the second set was determined by that of the first.We collect the differences between the participant's answers from his first and second set of trials.In addition, we compute the differences between the participant's answers and the answers the filters provide.For a trial s, we set the filter's answers to be the frame ˆ i s where the probability of detecting an edge is largest, according to the filter, i.e., ˆ i s = argmax k π sk (see Equation ( 4)).Note that, because the frame rate is now doubled, our filters must be upsampled 2 ×, so that the time-delta between its values is 1 /120 seconds, before they are applied to a sequence.For the non-parameterized filter, t all , we use linear interpolation between the filter values and generate t all 120 , where the bar indicates linear upsampling.For the DISEF filter, we double the sampling rate of the analytical function (Equation ( 9)), using the same parameters as for d all , to generate d all 120 .Similarly, we sample the time-domain TCSF at twice the rate that we did for c all to generate a 120-Hz version of it, denoted c all 120 .We use the same amplitude factor, a c , as for c all .An alternative to the above would be to linearly upsample the DISEF and the TCSF as well.However, doing so reduces the sharpness of the edge around the filter center and we found that the linearly upsampled versions performed worse in our evaluation. Let . ., v 2 n } denote the author's answers from the first and second run, respectively, where n = 20 × 30 = 600 is the total number of trials shown to the author in one of the doublepass sets.Furthermore, let ˆ i = { ˆ i 1 , ˆ i 2 , . . ., ˆ i n } denote the answers given by a filter.Different metrics may be used to compute the difference between v 1 , v 2 , and ˆ i .In the main experiment, we considered an answer to be incorrect if it was more than τ = 5 frames from the inserted edge.In the 120-FPS version, this tolerance was doubled to 2 τ = 10 frames to account for the increase in frame rate.We employ the same strategy for correct/incorrect here, and compute the differences as where 1 (x) is the indicator function which is 1 if x is true and 0 otherwise.This means, for example, that the indicator function in m 1 is 1 if the difference between v 1 s and v 2 s is greater or equal to 2 τ and 0 otherwise.The results are presented in Table 2 .The results indicate that intra-person variance is present in the experiment, as m 1 = 0 .125 > 0 .All three filters show similar results, but they all differ some compared to the participant.This suggests that the filters are able to predict answers reasonably well, though not quite on the level of a human.This could be due to the limitations of our model (see Section 4.2 ) as well as the limited dataset used for this particular part of our investigation.Furthermore, these results suggest that the filters generalize well to higher frame rates.In our flicker-detection study (Section 5.2 ), we see further evidence of generalization for frame rates in the 60 to 120 FPS range.

Application
Part of our motivation was to aid the design of video metrics.In this section, we explore using the new filters for flicker detection, as a substitute for a more traditional multi-channel frequency-domain model.
Because flicker consists of a sequence of edges in time, we assume that our filter's output can aid in flicker detection.To apply the filter, we first construct a video cube: a three-dimensional box of pixels that correspond to the sequence of frames that are displayed on screen one after another (Figure 7 ).This video cube can be convolved with the temporal edge detection filter along the time dimension.Strong absolute responses from the filter imply a high probability that a human observer will perceive an edge.
As a baseline, we consider the flicker detection model by Denes and Mantiuk [ 7 ], and use their dataset for validation.Their model takes a pair of color images as input, and estimates the probability of observers detecting flicker when the two images are shown alternatively at a given refresh rate.The estimate is given as a probability map.One advantage of using this dataset is its simplicity: the stimuli contain no motion.
After transforming the input images to luminance, Denes and Mantiuk's flicker detection model performs a multiscale decomposition of the images using a Laplacian pyramid, computes the differences between each spatial frequency band, and then modulates the result with the spatiotemporal contrast sensitivity function as approximated by the pyramid of visibility [ 48 ].The shape of the spatiotemporal CSF depends on the adapting luminance, which was approximated using the average luminance of the two input images.After modulation, the differences are put through a psychometric function to produce probability of detection estimates at each pixel for the different frequency bands.These are then pooled over the layers to yield a probability map.To fit free parameters of their model, as well as establishing its performance, Denes and Mantiuk conducted a user study where participants were shown alternating images and were tasked with marking where they noticed flicker.Because the brush used for marking had limited precision, the probability maps that the model estimated were blurred using a Gaussian kernel with a standard deviation that corresponded to the brush size.
We propose a simplified model, which replaces the multi-scale decomposition as well as the CSF model with our newly-derived temporal edge detection filters.Instead of looking at a single alternation of the images, we consider a full second of frames.Specifically, we sample the video cube at double the refresh rate of the displayed signal (i.e., two samples for each frame).We also sample the temporal edge detection filter at the same time points, then convolve the video cube with the temporal edge detection filter along the time dimension.Values in the output that would require values from outside the time domain range are omitted.To avoid scaling issues introduced by the changing refresh rates and the discrete convolution, we first normalize the edge detection filter such that all positive elements sum to 1, and all negative elements sum to −1 .
The filter response varies along the time dimension, and we take the maximum value for each pixel along the time axis to get the largest response perceived during the second.Before feeding the result of this operation into the psychometric function, we apply the same blur that Denes and Mantiuk apply to their probability map.We moved the blur to before the psychometric function as the latter applies a non-linearity prior to filtering, and filtering in nonlinear space is not desirable in general.While Denes and Mantiuk found that their results were improved when blurring in nonlinear space, this was not the case for us.Next, we feed the blurred result into the psychometric function.As our model output is at a relative scale at this point, we fit the parameters of the psychometric function ( α and β) to the data from Denes and Mantiuk where B is the video cube (3D), f is one of our derived edge detection filters, sampled at the appropriate rate, g is the Gaussian kernel used to blur the filtered result, and * denotes discrete convolution along the time dimension.Because there is no multiscale decomposition, we do not need to pool the result over different layers, but we can instead use the output of the psychometric function directly as a probability map.Table 3 summarizes the main results of this section.For an extended version of the table, see Appendix G .Because the dataset was collected at several different temporal frequencies, R ∈ [60 , 120 ] [ 7 ], our filters The table contains fitness (measured as mean log-likelihood) when trained on the entire dataset.The parameters α and β are of the Weibull psychometric function (Equation ( 11)).On this data, the TCSF-based edge detector ( c all R ) performs best.The ↑ is used to indicate that a higher number is better and bold marks the best result.
were upsampled accordingly, following the methods in Section 5.1 .The adjusted filters use the subscript R in Table 3 .The proposed simplified model performs comparably to the original multi-scale model by Denes and Mantiuk.
In summary, the use of the temporal edge detection filter simplified the flicker prediction model design by removing the need for the multi-scale decomposition and the subsequent pooling.After fitting the psychometric function to the data, all three proposed filters performed comparably to the original multi-scale model on the example data, with the TCSF-based filter performing the best.There are further advantages to the edge detectionbased approach, as it is more likely to predict step-changes in luminance more accurately than frequency-domain, CSF-based models, as the latter are often constrained by the periodic definition of the CSF.However, multi-scale CSF models are still duly popular, as these can be extended to capture cross-channel masking.As such, concurrent use of a temporal edge detection filter and a multi-scale CSF could capture the advantages of both approaches.For the purpose of reproducibility, the code and dataset used for the study in this section has been made available. 1

CONCLUSIONS AND FU T URE WORK
We set up and carried out a psychophysical experiment to retrieve estimates of the temporal edge detection filters in human vision.Three different models were compared: Free (each element of the filter is a free parameter during optimization), the derivative of the infinite symmetric exponential function (DISEF; Equation ( 9)), and a model based on the temporal contrast sensitivity function (TCSF; Section 4.3 ).The three estimates were similar and so was their performance in our evaluation, with the DISEF (with a d = 148 and s d = 0 .016 ) and TCSF (with a c = 1932 ) estimates showing moderately better generalization to refresh rates higher than they were estimated for.In addition, those filters showed promising results in an experiment where they replaced a multichannel spatiotemporal contrast sensitivity model in a flicker-detection framework.As they perform similarly in our evaluation, either filter can be used in future applications, though we note that the DISEF is simpler to generate.
In future work, we would like to incorporate our temporal edge detection filter into a spatiotemporal extension of the spatial image metric F LIP [ 2 ].We believe the filter will be an important building block for that extension.In addition, it would likely be useful to extend our experiment to handle spatiotemporal patterns, though both the experiment and analysis may become significantly more challenging in that setting.Finally, while natural image sequences tend to exhibit Brown noise, rendered sequences may include, e.g., blue or white noise [ 49 ].Therefore, it would be of interest to investigate the effect on the estimated filters when the color of the noise in the stimulus is changed.

APPENDICES A RESULTS FROM INDIVIDUAL PARTICIPANTS
In this appendix, we present data collected in the main user study but that was not directly used to estimate filters.
Figure 8 shows how long, on average, each participant needed to complete their trials.Because each sequence was two seconds long, the number of sequence repetitions those times correspond to is simply half of the number of seconds.On average, the participants needed 16 seconds (or eight repetitions) to set their marking.
The average accuracy of the individual participants is shown in Figure 9 .We see that accuracy was generally higher than 80 % with a mean accuracy of 84 % .
Figure 10 summarizes the mean step-change contrast used in each participant's set of trials.The mean contrast was 0.35.In Table 4 , we show the average time required to complete the trials as a function of the step-change contrast.We see that the time increases as the contrast decreases and the difficulty of the task increases.Finally, the relation between accuracy and contrast was presented in Figure 5 .

B MISESTIMATED DISPLAY GAMMA
As mentioned in Section 2 , we aim to use the filters estimated in this article for a spatiotemporal image metric.Ideally, this metric should be robust to small changes in the gamma transform of the display, so that its results are valid for users who do not necessarily use a calibrated monitor.A requirement for such a metric to be robust is that the filters estimated here are robust.In this appendix, we therefore examine how our estimated filter changes due to misestimated display gammas during analysis.First, we consider small changes in display gamma.Second, we investigate how the filter is affected when the observer's monitor is neither linearized nor calibrated.We investigate only the Free filter, assuming that the outcome would be similar for the DISEF and TCSF filters.

B.1 Robustness Analysis
To investigate the robustness of our filters to misestimations of the display gamma, we use the results of our main user study, where the display gamma transform was known ( a m = 0 .0 , k m = 1 .01 , γ m = 2 .08 in the simple gamma transform a m + (k m b) γ m ).We first apply the inverse display transform to the sequences, as we did in the main analysis (Section 4.1 ).Then, we raise the sequence values to γ ∈ {2 .0 , 2 .For each γ , we then estimate a filter using the resulting sequences and the procedure desribed in Section 4.1 .In Figure 11 , we show those filters and the one estimated in the main section of this article, which used the correct gamma of the display.We see that the results are similar, indicating that the filters are robust to small changes in display gamma.

B.2 Online Experiment
To allow people outside our laboratory environment to participate in the experiment, and thereby increase the number of users, we hosted a version of our experiment online.As a consequence, we could not guarantee that the participants in this version of the experiment were looking at monitors with photometric calibration.Still, participants were asked to have their display in a dim environment, without any direct sunlight shining on it.In total, 57 people, aged 24-58, did this version of the experiment.The set of participants included people from a variety of fields, including computer graphics and computer vision experts, mathematicians, and administrative personnel.Some of the computer graphics and vision experts that participated in the main experiment were also part of this online version.Each participant was asked to read a consent form and give consent to participate before starting the experiment.Each participant were asked to run the experiment once.
A critical difference between this version of the experiment and the main study is that we did not make any attempt to linearize the displays for this version.Consequently, the gamma of our sequences was approximately a power of 2.2 off compared to linear, assuming sRGB displays.As our stimuli are supposed to constitute an edge added to Brown noise, because we here pass it through a inearity before it is shown on the display, we lose the additive property.However, after examining the amplitude spectrum of Brown noise that had a power function applied to it, we found that the noise preserves the property of being Brown (the spectrum was proportional to 1 /f [ 9 ]), though we note that it, in general, will not be Gaussian anymore.
Given that the noise is still Brown, the sequences shown in this version of the experiment had almost the same properties as those in the main experiment.To investigate the effect of the differences between this version of the experiment and the main one, we apply the 2.2-power transform to the sequences used for the online study, estimate a filter using the procedure described earlier (Section 4.1 ), and compare it to the filter found with the main study data.The results are presented in Figure 12 .While the amplitude of the filters differ, and the tails of the online study filter vary more, their overall shape is similar.Together with the robustness-results in Section B.1 , we have now seen that the estimated filters are robust both to small and large differences in display gamma, indicating that they can be used without knowing the exact gamma of the user's display.
Because the pool of participants was more varied in the online version of the experiment compared to the main one, we include Figure 13 , which shows how the online experiment's participants' accuracy varied with the step-change contrasts.Figure 13 is analogous to Figure 5 for the main experiment.However, note that the step-change contrasts described on the x-axis in Figure 13 do not match the relative luminance contrast that participants were presented with in the online experiment.This follows from that the displays used for the online experiment were not linearized, implying that the sequences' relative luminance values, as defined in Fig. 12.Comparison of our estimated filter using the data from the main, controlled user study (dotted gray) versus that using the data from the online user study (solid blue), where the displays were neither calibrated nor linearized.Fig. 13.Bar plot showing how the online study's participants' accuracy depends on the step-change contrasts, c, as they were prior to the display applying a power function of approximately 2.2 to the sequences.Note that the displays in this study were not linearized.Because of this, as explained in the main text, the relative luminance of the displayed sequences in the online study were raised to a power of 2.2, approximately, compared to those in the main study.Furthermore, in these results, accuracy was relatively low at the largest contrast, suggesting that some of the people who participated in the online study found the mechanics of the experiment difficult.The yellow numbers inside the bars represent the number of times a given step-change contrast was used to define a sequence.The plot in this figure is analogous to that in Figure 5 for the main experiment.Fig. 14. comparison of our estimated filter when motor error is accounted for (solid blue; see Equation ( 4)) compared to when it is not (dotted gray).Notice that the filter that does not account for motor error has its peaks far apart in time.
Section 3.1 , were raised to a power of 2.2 before they were presented.This transformation changes the values in the entire sequences, including the added step.After being transformed, the sequences are encoded in units that are closer to perceptually uniform than the original, relative luminance values [ 32 ].Whether this fact made the task easier or more difficult requires an analysis that could be insightful but which we leave for further work.Nonetheless, comparing Figures 5 and 13 , we see that, in the online version, many trials were done at the largest contrasts, suggesting that some of the participants in that study found the mechanics of the task difficult.

C THE IMPACT OF THE MOTOR ERROR FACTOR
In this appendix, we compare our estimated Free filter when the motor error (Section 4 ) is accounted for during filter estimation versus when it is not.Figure 14 shows the filter we estimated in the main part of this article ( t all ) as well as the filter we estimate when the motor error convolution (Equation ( 4)) is excluded during optimization.The results indicate that the motor error has a significant impact on our data and neglecting it results in a poor filter estimate.The resulting filter is relatively flat near its center and shows two relatively small peaks around |t | = 6 /60 seconds.We note that, for the optimization without motor error, we relax the regularization of our filter (Equations ( 6) and ( 7)) because doing so resulted in better fits.

D FILTER THEORY DETAILS
In this section, we present details on the filter estimation, including details on how we estimated the participants' motor error.
Motor Error Estimation.Before computing the distance between a marked frame, ˆ I , and the frame containing the inserted edge, I , we subtracted 0.5 from the latter in order to have the edge location be between the two frames that constituted the edge.To reduce the impact of outliers in the motor error estimation, we used robust estimators for the mean and standard deviation.In particular, we used the median absolute deviation for σ m and the Huber estimator [ 13 ] (with 10 iterations and regression constant 1.345, resulting in 95% efficiency) for μ m .8), we note that the final probability estimates were clamped to a minimum of 0.001 before being input into the logarithm in order to avoid significant changes in loss value between very small and extremely small probabilities.Furthermore, because our sequences are circular (Section 3.1 ), we use circular padding for our convolutions, and we renormalize the probabilities after applying the motor error Gaussian (Equation ( 4)) because the probabilities do not necessarily sum to one after that convolution.The optimization was performed by a Python program using the PyTorch library [ 30 ] and the Adam optimizer [ 19 ] with learning rate 0.01 and 200,000 training epochs.

E GENERATING THE TIME-DOMAIN VERSION OF THE TCSF
Given a TCSF, defined analytically in the frequency domain, we here describe how we generate time-domain versions of that TCSF.In particular, we generate time-domain versions sampled at frequencies R ∈ [60 , 120 ].
First, let the sampling frequency be F s = 960 Hz, chosen as it is well above the Nyquist limit for our target sampling rates.The number of samples is set to N s = 2 F s + 1 .Next, we sample the TCSF at ω = {0 , 1 /(F s − 1 ), . . ., (F s /2 )/(F s − 1 )}.Let the result be R T .To make the filter odd, we now create the vector ˆ R T = {0 , iR T , −i flip (R T )}, where i = √ −1 and flip (x ) reverses the order of the elements in x , and apply the inverse Fourier transform to it and shift the result so that t = 0 is at the centre of the filter.Let the result be F −1 ( ˆ R T ).The time-domain samples are now 1 /F s = 1 /960 seconds apart and sampled at t = {−1 , (−F s + 1 )/F s , (−F s + 2 )/F s , . . .(F s − 1 )/F s , 1 }.Finally, we can resample F −1 ( ˆ R T ) at t R = {−1 , (−R + 1 )/R, (−R + 2 )/R, . . .(R − 1 )/R, 1 }, using linear interpolation, and multiply by the estimated amplitude, a c , to create c all R .For example, with R = 60 Hz, we get c all and with R = 120 Hz, we get c all 120 .

F CROSS-VALIDATION RESULTS
For completeness, we here present the full set of results of our four-fold cross-validation study.Parameter fitness was determined by the value of the objective function (Equation ( 8)) applied to the training and validation sets after finishing training the filter.The smoothness and non-oddness penalties were excluded during validation.
The results of the cross-validation are reported in Figure 15 .The motor error estimates for the different crossvalidation splits are recorded in Table 5 .The highest average validation score was given for smoothness weight ζ = 0 .005 , non-oddness weight ξ = 0 . 1 , and filter length n l = 21 samples.For the DISEF fit, we conducted the same cross-validation as for the Free filter, comparing filter lengths of 11, 21, 31, and 41.The scores were −2 .7267 , −2 .7274 , −2 .7274 , and −2 .7274 , respectively, suggesting that a length of 11 (and radius of r d = 11 /2 = 5 ) samples was sufficient and that increasing the filter length did not impact the best DISEF fit notably.

G FLICKER-DETECTION RESULTS
In Table 6 , we present the complete version of Table 3 , including the mean log-likelihoods for each of the threefold cross-validation splits and the corresponding α and β values.Three-fold cross-validation was used instead of four-fold cross-validation due to the limited size of the dataset.As opposed to the cross-validation used for the filter estimates, all data was used for the cross-validation, i.e., no test dataset was excluded.), for the four different folds and their average.The marker's shape indicates the choice of ζ (more corners correspond to higher values of ζ ), its color encodes the choice of ξ (a brighter color indicates a higher value of ξ ), and the brightness of its outline shows the length, n l , of the filter (a brighter outline implies a longer filter).In this plot, larger y-values are better.Our chosen configuration was ζ = 0 .005 (triangle), ξ = 0 . 1 (brightest face color (yellow)), and n l = 21 (dark gray outline) as it scored the highest on average on the validation sets (highest value in the rightmost column).Table 6.Complete Version of Table 3 , Showing Fitness of Different Models to the Flicker Dataset Collected by Denes and Mantiuk [ 7 ]

Fig. 1 .
Fig. 1.A simplified illustration of the construction of the stimuli used to estimate the temporal edge detection filters in human vision.Gaussian Brown noise is generated by taking the cumulative sum ( ∫ dt) of Gaussian white noise.The final stimulus consists of the Gaussian Brown noise added ( +) to a step edge, with the step occurring at a random time.Participants in our study were asked to mark the point in time with the largest intensity change.Full details are in the article.

Fig. 3 .
Fig. 3.One frame of an example stimulus.Each of the parts 1-4 is explained in the text.

Fig. 5 .
Fig. 5. Bar plot showing how the participants' accuracy depends on the step-change contrasts, c.The yellow numbers inside the bars represent the number of times a given step-change contrast was used for a trial.For results from individual participants, see Appendix A .

Fig. 7 .
Fig. 7. Illustration of how we adapted Denes and Mantiuk's flicker detection model [ 7 ] to use our proposed temporal edge detection filter instead of the multiscale decomposition and spatiotemporal contrast sensitivity modulation used originally.This should be compared to Figure 1 in Denes and Mantiuk's paper.

Fig. 8 .
Fig. 8. Average time each participant needed over their set of trials.The dashed line shows the average time required for all participants.

Fig. 9 .
Fig. 9. Average over each participant's set of trials.The dashed line shows the average accuracy for all participants.

Fig. 10 .
Fig. 10.Average step-change over each participant's set of trials.The dashed line shows the average step-change contrast for all participants.

2 , 2 . 4 }
(assuming the a and k values are the same as the measured values a m and k m , respectively), yielding relative luminance values on the form b γ = b γ γm .

Fig. 15 .
Fig. 15.Cross-validation results.The five leftmost columns show results on the training sets, including regularizer penalties ( ˆ L /n train s − ζ S − ξO, where n train s = 0 .6 n s = 540 ), while the five rightmost columns show results on the validation sets ( ˆ L /n val s , where n val s = 0 . 2 n s = 180), for the four different folds and their average.The marker's shape indicates the choice of ζ (more corners correspond to higher values of ζ ), its color encodes the choice of ξ (a brighter color indicates a higher value of ξ ), and the brightness of its outline shows the length, n l , of the filter (a brighter outline implies a longer filter).In this plot, larger y-values are better.Our chosen configuration was ζ = 0 .005 (triangle), ξ = 0 . 1 (brightest face color (yellow)), and n l = 21 (dark gray outline) as it scored the highest on average on the validation sets (highest value in the rightmost column).

Table 1 .
Record of the Average Log-likelihood that the Estimated Filters (Free: t all , DISEF: d all , and TCSF: c all ) Performed on the Test Set ( n test s = 180 Sequences)

Table 2 .
Results of the Double-pass Analysis where m 1 Corresponds to the Rate at which One Participant Makes a Significantly Different Marking on a Second Run through their Trials and m 2 States the Rate at which the Model Predicts an Answer Significantly Different from the Participant's Marking (Equation ( 10 ))How similar a filter's m 2 value is to m 1 can be interpreted as how well the filter predicts human responses; the closer, the better.The best result is written in bold.

Table 3 .
[ 7 ]ss of Different Models to the Flicker Dataset Collected by Denes and Mantiuk[ 7 ]

Table 5 .
Motor Error Estimates for Each Cross-validation Split While not explicitly stated in Equation (