Known-Item Search in Video: An Eye Tracking-Based Study

Deep learning has revolutionized multimedia retrieval, yet effectively searching within large video collections remains a complex challenge. This paper focuses on the design and evaluation of known-item search systems, leveraging the strengths of CLIP-based deep neural networks for ranking. At events like the Video Browser Showdown, these models have shown promise in effectively ranking the video frames. While ranking models can be pre-selected automatically based on a benchmark collection, the selection of an optimal browsing interface, crucial for refining top-ranked items, is complex and heavily influenced by user behavior. Our study addresses this by presenting an eye tracking-based analysis of user interaction with different image grid layouts. This approach offers novel insights into search patterns and user preferences, particularly examining the trade-off between displaying fewer but larger images versus more but smaller images. Our findings reveal a preference for grids with fewer images and detail how image similarity and grid position affect user search behavior. These results not only enhance our understanding of effective video retrieval interface design but also set the stage for future advancements in the field.


INTRODUCTION
Content-based image retrieval has rapidly evolved since the beginning of the millennium [21] to employ deep learning-based approaches [5].Current approaches allow for easy ranking of image/video datasets with respect to an example query or a text description.With just a few lines of code, developers can reach impressive search effectiveness as long as pre-trained models are available [10].These models provide comfortable joint spaces [18], suitable for both text-image and image-image retrieval using cosine similarity  cos .Specifically, it is sufficient to compute  cos    (),    () for all video frames  in the collection, where    is a feature extraction function for a query object (e.g., text), while    is a feature extraction function for images.While more effective image-image search models exist [20,29], the simple system design and low memory footprint make such a single similarity model attractive.
Since no similarity model can provide a perfect ranking of the search results, the design of the search interface is relevant for the effective usage of similarity search systems.However, designing optimal interfaces for result set visualization and browsing is a difficult task [15,24] due to the dependence on user preferences, behavior, and the search task itself.In this paper, we focus on known-item search (see Section 3.1), which remains challenging [10] even in the era of highly effective machine learning models.More specifically, we focus on the browsing phase once a result set is obtained for a query object and analyze user interaction and search strategies.
Multiple studies on search interface design [2,4,8,9,12,13,[25][26][27][28] analyze aspects such as positioning, content-task-dependence, and several of these include eye tracking analysis to directly assess the user behavior.Eye tracking is beneficial to understand how visualization and interface design influence user behavior, enabling a more detailed, quantitative analysis of behavioral and pre-attentive patterns compared to solely qualitative analyses.
For our analyses, we use an eye tracker to inspect the behavior of users searching for one particular video frame in a vertically scrollable result set in order to understand better the reasons for the search outcomes (item found or overlooked).The left part of Figure 1 shows a heat map of collected eye gazes for one user in one task.Instead of images, the grid cell colors indicate the similarity of the grid items to the searched video frame (white color means most similar).The right part of the figure shows an example of collected data of 15 known-item search tasks performed by one user, where each search session is represented as one column.It shows captured eye gazes as colored paths over the whole result grid (including scrolling), while the correct target is presented as a green rectangle.Green dots show filtered activity (e.g., submission form).The eye gaze path is colored based on the elapsed time.Grids with 8 columns are indicated with a smaller background rectangle.Based on this eye tracking data, our analysis aims to support the understanding of video known-item search browsing effectiveness, contributing insights addressing the following essential questions: • Is it better to use a 4x50 or 8x25 grid size for known-item search, given the top-ranked 200 video frames for a query?• Are there overlooked target video frames and what is their impact on the overall search performance?• Which regions in a grid attract the most attention and which remain without?Does this change with scrolling?• Do users mostly ignore video frames (no eye gazes) that are highly dissimilar to the target and what is pre-attentively perceived as dissimilar?

RELATED WORK
Applying eye tracking on search engine results displays is a popular technique utilized to analyze contextual, resource, and individual factors [8].In this paper, we focus on known-item search in an "alternative" grid interface design (i.e., not list), which is popular for image search result presentation.Image grid interfaces were recently studied with eye tracking devices [27], showing that images in middle columns have lower first arrival time and higher fixation duration.Furthermore, the study confirmed that the content of images affects behavior as well as positioning.This study used three types of tasks, namely specific, generic, and abstract.Lu et al. [13] also focused on the analysis of user attention in image grids.The authors analyzed the effect of grid cell position and task type, finding that cell positions in the top center area are getting more attention and that the type of search task influences the user behavior.Another study analyzed users' attention in interfaces [2].Two exploratory tasks (i.e., finding diverse images matching some topic) for different interfaces were tested and the authors present heat maps of user attention and time spent in areas of interest.
Eye tracking can also be used in image retrieval to directly improve search effectiveness [16].In [9], a method combining eye tracking data and image segmentation is used to design a more effective similarity retrieval model.The model uses gaze duration to estimate region importance in images.In [28], the authors use eye tracking-based heat maps to classify images that are relevant for implicit relevance feedback.Results show that this non-intrusive method has similar results as explicit relevance feedback by mouse click.In [4], the authors successfully combined eye tracking data with EEG signals to discover intended query interpretation.Another study [25] focused on correlation between eye movement during remembering an image and content of the image.This eye movement was tested as another descriptor for image retrieval.
Eye tracking devices offer numerous possibilities and incentives for examining user behavior to enhance search systems.Our study utilizes eye tracking data to better understand search outcomes in different grid configurations.The goal is to perform additional analyses in the context of known-item search [12] and with more similarity search insights based on the popular open CLIP model [7,18].While not predominantly focused on eye tracking, there are studies that analyze user behavior within image search engines [26].
Our study aims to complement these works with additional findings obtained for state-of-the-art joint embedding models.

USER STUDY SPECIFICATION
Previous work on efficient and effective video retrieval systems mainly focused on the process of retrieving the best matching images when given a user-defined query.The arrangement of these images-and the role of the user in checking these-still remains under-explored.With our work, we aim to fill this gap and provide insights regarding user behavior when analyzing the retrieved images.To this end, we apply an eye tracking device to analyze scanning and inspection strategies, detect differences in user attention based on image positions, and aim to investigate reasons for overlooking the target image.Further, we compare two different grid structures, namely a four-column layout and an eight-column layout, for differences.Thus, we want to investigate the trade-off between placing more images on a screen page with the downside of smaller images with less perceivable details against having fewer, but larger images with more details, with the downside of more required scrolling and less overview capabilities.

Data & Task
For the study, we selected a domain-specific marine video dataset MVK [23] that has been used as a benchmark dataset at the Video Browser Showdown [10,12] competition since 2023.The motivation for using a domain-specific dataset in the study is to focus on search tasks where result sets are more homogeneous.For each video in the MVK dataset, representative keyframes were extracted and for each of them, CLIP-based embeddings [7,18] were computed.
As a search task, we consider visual known-item search [12] (KIS) where users want to find one particular image, but remember only the content of the image (i.e., filename or ID is not available).In our case, users are required to find one particular image in the dataset of extracted representative keyframes just based on their memories.The target image is presented to the users at the beginning of each search task or later on demand in a pop-up window.In addition, users do not have to browse the whole dataset to find the target but are required to inspect just a smaller predefined ranked set of images.These ranked sets   are defined as 200 nearest neighbors of representative images   in the dataset.In other words, a fixed initial image query   starting the search is assumed.In total, 77 sets   were prepared in advance, aiming to cover as much diversity of the MVK dataset as possible.Target images   were randomly sampled from   (Items closer to the top of the list had a slightly higher chance of being sampled, corresponding to the common patterns in known-item search results [11]).This means that the target image was always present in the shown result list, but study participants were not aware of this fact.Hence, we provided an answer option stating that the target item was not in the list.
Since the study also aims to compare two different display settings (4-vs.8-column layout), we use a randomization strategy to avoid any systematic bias in later tasks, employing the following shuffling procedure between individual users for assigning task subsets and display settings: All users received 30 search tasks in total.The first task was meant only to familiarize users with the environment and was excluded from the results.The next six tasks were the same for all users, while the remaining 23 tasks were assigned randomly for each pair of users.The layout variants were assigned to every odd user at random (ensuring equal layout tests per user), while for each even user, we utilized inverse layouts as compared with the previous (odd) user.For example, if tasks <5,8,6,1> were assigned to user three to solve using grids <4,8,8,4>, then user four will receive tasks <5,8,6,1> using grids <8,4,4,8>.

Apparatus
For the user study, we invited the participants to individual sessions of approximately one hour taking place in a university laboratory.We used a traditional desktop setup, as used in common video retrieval setups.Thus, the participants were seating at a desk with a monitor, mouse, and keyboard.We used a 16:9 Dell U2715H monitor with a diagonal of 27 inches and a resolution of 2560 × 1440px.Right under the monitor, we placed the eye tracker by The Eye Tribe [22] and made sure that it faces the user at the optimal angle.While many eye tracking user studies incorporate a chin rest for head stabilization, we omitted a head fixation, as it can significantly reduce the participants' comfort and is mainly applied when very high precision is required.However, during the study, the accuracy was continuously monitored, and in the rare cases (3 times), a repositioning was necessary, which was done before a task started.We used the API of the eye tracker to retrieve the current tracking state and eye pixel location with 30 Hz, processing the results in a backend, and transmitting them to the frontend.Our frontend (see Figure 2) consists of a website connecting to the backend using WebSocket.It loads the correct configuration for the current participant, displays the image grids for the corresponding tasks, and logs  all user actions, interface states, answer times, and the eye tracking information.Further, users could press the space bar at any time to view the target image again.To ensure instantaneous feedback of the eye tracking results to the user, we applied subtle border highlighting of the focused image grid cells and asked participants to inform the study supervisor when the highlighting did not match the actual focused images anymore.

Participants & Procedure
We invited 20 participants (ten male, ten female) with normal or corrected-to-normal vision.Their age varied between 19 and 32 years ( = 25.1,  = 4.43).We did not require prior knowledge and most participants had little or no previous experience in video retrieval or eye tracking (see Figure 3).Every participant performed the following procedure: After signing a consent letter, the participant received an explanation of the task and the testing application.After a short question period, calibration of the eye tracker was performed and microphone recording started to capture comments during the study.Then, the 30 tasks were performed with a fixed break after 15 tasks and possible further breaks at any time on request.After the study, the participant filled out a questionnaire and received the standard monetary compensation of 10 EUR.
To validate our setup, we conducted a small-scale pre-study with five users.The participants had normal or corrected-to-normal vision (three with glasses, two without optical aids).For each participant, the eye tracking device was calibrated, followed by a brief introduction to their task.The participants observed multiple image grids with four and eight columns.Images were highlighted according to the eye gaze signal received.We asked the participants to look at different images in the grid and validate whether the focused images matched the visually indicated images.All users reported that the eye tracker measurement of their focused images worked robustly for both grid structures.Thus, the eye tracking setup was used without adaptations for the user study.
• Gaze data was only logged during a task.If there was a restart in a task (rare, due to eye tracker re-calibration), only the latest start is considered as the task start.• The events when the eye tracker rendered gaze coordinates outside the screen boundaries were filtered out.• Lost tracking, indicated by zero coordinates [17], is filtered.
• In order to help users remember the target image, the users were allowed to look at the target video frame by pressing the space bar.The users could also click on a promising video frame to show the submit form.During these "dialog" periods, the results grid was hidden, and gazes were ignored.

Gaze Data Limits
To assess the limits of the eye tracking and study framework, we checked whether tasks were solved without detecting eye gazes on the target item.A solved task implies that users had to click on the searched video frame, and thus, through missing eye gazes, we may estimate the level of framework robustness.Overall, there were no missing eye gazes for the 4-column grid layout, but some solved tasks in the 8-column grid layout occurred without a registered eye gaze (see Table 1).The majority of the cases (15 out of 17 assuming a 10px margin) were caused by only four users.For both settings, there were also cases with just a few (1, 2, 4) consecutive eye gazes over a target image while the task was still solved.We hypothesize that there are two key reasons for this finding: First, eye tracking devices (especially low cost) do not have 100% pixel point detection accuracy, and thus, eye gazes can be detected with a small noise [17], especially towards the screen boundaries.If users reach a marginal region of an image, there is still a chance that eye gazes are not detected for this image due to these inaccuracies.Second, users could identify the images with a peripheral view and click on them before there was a detected eye gaze.The submit form (in the center of the screen) contained a larger version of the image, so the users did not have to inspect a promising item in detail within the grid.On the other hand, in most cases, there were recorded eye gazes in close proximity to the target items as reported in Table 1.Therefore, we can assume that the detection noise is not very large.Overall, we observe that the recorded gaze dataset is not 100% complete and accurate, while it still provides a solid approximation of the performed searches.
Table 1: The number of solved tasks with more than  consecutive eye gazes in the target frame + detection margin (in pixels).Blue color indicates that all targets were detected.

Grid Setting Comparison
Every solved search task ended in one of three possible states, namely, the target item was found (i) correctly, the user submitted an (ii) incorrect item, or the user submitted that the (iii) target was not in the list.In addition to these states, we computed an indicator approximating a search session where users first overlook the target item but later find it.The approximation was computed by  max −  target >  , where  max represents the maximal reached/observed  coordinate in the grid,  target represents the bottom  coordinate of the target image, and  was set to 1440 corresponding to the utilized horizontal resolution of the screen.A softer version of overlooking was tested as well for  = 720.Table 2 summarizes the overall search outcomes with the average answer times for both grid types.For the answer times, we used Shapiro-Wilk tests indicating non-normal distributions.Thus, we applied Mann-Whitney U tests, finding significant differences in total, as well as for the correct and incorrect cases ( = 0.05).
We observe a high success rate for both types of grids, 253 solved tasks for 4 columns and 240 solved tasks for 8 columns.This could be attributed to the simplicity of some tasks and to the careful inspection of the grids by user study participants who often inspected the same grids in multiple iterations.Nevertheless, it is evident that in real search scenarios, although users may initially overlook the target and discover it later, the viability of this search strategy is questionable since users usually have the option to refine their search query.From this perspective, the 8-column setting has almost twice as many overlooked targets as the 4-column setting in both soft and hard versions of  -based overlooking.Regarding search times, it is obvious that once a target is overlooked, it takes way more time to find it (even though the average target row number is similar for found-directly and found-overlook).The average times depend on the overlooking threshold  , sending usually shorter tasks to the overlooked category for lower values of the threshold.In the found category, the search time is significantly lower for the 4-column setting, also due to a lower number of overlooked items.
Surprisingly, the number of incorrect submissions is not negligible (6% to 8%), even though users were allowed to easily view  the searched target image on demand by pressing space bar.We manually inspected all 39 cases and almost all incorrect submissions corresponded to near duplicate images from the same video (see first two columns in Figure 4), where only in five cases the submitted image was from a different video than the target and in only one case (8 columns), the submitted image was highly different to the target (in terms of dominant colors, see the last column in Figure 4).These findings illustrate difficulties in the identification of exactly the same video frame using just (even imminent) memories.In the future, the testing interface could provide feedback about correctness, returning users to the search process as the "true" target image was ahead in about 50% of these cases.Lastly, the table shows the number of searches where users pressed the "not in list" button.This category has the highest average search times in both grid settings.In almost all cases, there was a detected eye gaze in the target item area or its direct grid neighbor during a search session (see Table 3).In some cases, there were even more detected eye gazes in a row in the target region.Yet, the user did not submit the target video frame.We may observe a 40% higher number of not found items for the 8-column setting, having slightly lower answer times than the 4-column setting.We hypothesize that users may be more frustrated when searching in a larger number of smaller images on a single page, especially when no additional image sorting strategy [3] is used.
Overall, the search outcomes in Table 2 reveal indicators (avg.search time and success rate) leaning towards preferring 4-column setting over 8-column setting in the utilized dataset and for the tasks.Especially when only one pass is expected as the dominant search strategy, the 4-column grid has significantly (Fisher test) more solved tasks.The collected data also confirmed that in some  cases, there was a sequence of more than eight tracked coordinates in the correct item area, yet users did not recognize it and later selected the not in list option.For the 4-column grid setting, it was more likely to solve a task if the target was not close to the grid boundary as reported in Table 4.For the 8-column grid setting, this pattern was observed only during the first pass, assuming a soft overlook.However, users were able to find many searched items near the right boundary in the following search iterations.

Fixation Statistics
We first consider page-wise and screen-wise fixation distributions before analyzing row-and column-patterns, and finally the impact of displayed content on the recorded data.

4.3.1
Page-wise and screen-wise statistics.The overall fixations pattern (see Figure 5) shows that in both layouts, users exhibited a centralized focus with a clear distinction between individual columns.I.e., the highest fixation density could be observed around the horizontal center of each column, while columns in the central part of the screen were overall more fixated than those closer to the edges.The inter-column differences were rather substantial as the least fixated column received only 60% of the most fixated column's fixations in the 4-column grid and 56% in the 8-column grid.
The fixation volumes tend to decrease steadily w.r.t.row ID starting from the second row.However, in the 4-column grid, the first row was considerably less fixated than the second row, while in the 8-column grid, even the third and fourth rows attracted more fixations than the first one.Further, as target items were distributed throughout the list of results, many users did not scroll to the lower sections of the page, as they found the target image before.
As for the vertical fixations distribution over the screen (i.e., scrolling ignored), users exhibited central or slightly lower focus on average.The mean vertical gaze coordinate was 774 and 693 for 4-column and 8-column grid layouts resp., corresponding to 54% and 48% of the screen height.This was a bit surprising for us as we expected more focus to be on the upper parts, similar to page-wise results.We hypothesize that after observing the initially visible section of the page, users mostly keep their focus closer to the bottom of the screen to check for newly emerging items.To verify this, we divided the gaze data into entries recorded while the page was in the initial position (i.e., no scrolling) and the rest.With no scrolling, the mean of vertical gaze coordinates were 550 and 525 for the 4 and 8-column grid resp., while with scrolling the means were 805 and 769 resp.Note that during the scrolling phase, users could also scroll back to the top of the page (more often for 8-column grid layouts), which might had an inverse effect.

Fixation existence probability for rows and columns. Given the highly diversified distribution of fixations across the page, one may
wonder what the chances are that items on certain positions are not perceived at all (i.e., had zero recorded fixations).In this section, we specifically focus on per-row and per-column aggregated statistics that items from the specified row/column had at least one fixation.However, we first need to filter out the trivial cases where users omitted whole sections of the page, typically because the target item was found directly during the first pass.In such cases, we mostly observed a strict row-wise eye gaze pattern, i.e., the row with the target item (or, sometimes, 1-2 rows below) is rather thoroughly explored, but none of the next ones are (see Figure 6).Therefore, we decided to pre-process the data for this experiment as follows.For each user and solved task, we detected the last row with a recorded fixation and then discarded all items displayed on the subsequent rows.For all the remaining items (i.e., items displayed in top-k rows, where k corresponds to the last observed row ), we checked whether the item received at least one fixation or not. Figure 7 (right) depicts the probability that an image was fixated as a factor of its row, while Figure 7 (left) depicts the same statistic w.r.t.columns.There are three important observations regarding row-wise results.First, throughout all rows, the fixation probability is substantially higher for the 4-column grid than for the 8-column grid layout.This may be the main explaining factor for the differences in overlook counts (see Table 2).Second, for both layouts, items displayed in the first row have the lowest chance of being observed in the given setting.This also corresponds to the lower ratio of target items that were found immediately (i.e., correct without soft overlooks).Ratios for the first row were 57% and 43% for the 4-column and 8-column grids resp., while overall rates were 74% and 56% resp.Therefore, it might be advisable not to use classical relevance-wise ordering of items in the full screen grid mode, as those with the highest estimated relevance are displayed at positions with the lowest probability of being noticed (especially in scenarios where users are not expected to scroll back very often).However, the percentage of all solved tasks (including overlooks) with the target in the first row is rather high compared to other rows.We hypothesize that the low diversity of the first k video frames for the initial query object might, in the end, allow  for efficient peripheral filtering.Third, surprisingly, the fixation probability tends to increase slightly as row ID increases (especially in the 4-column grid layout).This is the inverse of what we expected.The increase is not very large but significant: for a 4-column grid, the fixation probabilities are 0.49 and 0.56 for the first and second halves, respectively (Fisher's exact test p-val: 7.8e-19).For the 8-column grid, the probabilities are 0.33 and 0.35 (p-val: 3.4e-5).
We came up with two hypotheses for the observed behavior.First, after the target item is not found within the first couple of seconds, users might get more anxious, resulting in a more thorough screening of the images.Second, given that displays were constructed as lists with decreasing seed image similarity, items displayed on the later rows were, on average, more diverse from each other than items displayed early on.Thus, users might need a more thorough observation and, e.g., could not rely so much on peripheral vision.

Fixations-Content
Relation.We also investigate how the content of the page altered the observed behavior of users.We hypothesize that while searching for the target item, users would not screen all images equally (or only based on some global patterns), but rather prioritize those that seem more similar to the target item.
To verify this, we focus on correlations between the image-target cosine similarity of the corresponding CLIP embeddings and both the fixations' duration and the volume of raw gaze data recorded within the image bounding boxes.Note that we removed completely unexplored page sections in the same way as in Section 4.3.2.To capture different possible flavors of the relation, we considered two data variants.First, denoted as Filter: All, contains all items not removed in the previous step, including those that did not attract any fixation/gaze (i.e., detecting both the existence and the magnitude of gaze behavior).Second, denoted as Filter: Nonzero only contains items with an existing fixation/gaze record, i.e., conditional dependencies on the magnitude of gaze behavior if it exists.
The results are depicted in Table 5 separately for 4-column and 8-column grids as well as for all data.Overall, we detected modest positive correlations between all phenomena.The correlations were stronger for raw gaze data rather than fixations' duration, for conditional magnitude rather than for all data, and finally for 4-column grids rather than 8-column grids.The highest reported correlation was 0.311 for 4-column grids and conditional magnitude data.We also evaluated whether the items more similar to the target have a higher chance of being fixated at least once.This is indeed true, as mean item-target similarity equals 0.824 and 0.797 for items with and without a fixation, respectively.However, although the difference is significant, it is rather moderately substantial (mean values correspond to the 44th and 57th percentile).Furthermore, in three of the 77 sets, participants displayed a rather distinctive behavior (see Figure 8): low attention to the top rows, with the few fixation points concentrating exclusively in the middle columns.Additionally, the scrolling speed for these rows is almost four times higher than on average (1.30 to 0.33).This scrolling pattern occurs solely in the 4-grid structure and cannot be explained by the image-target similarity of the CLIP feature vectors.Manual checking revealed that the color distribution of the top rows was dissimilar to the target image.To assess color similarity, we used the LAB color space [14], subsequently generating histograms with 12.5 bin width.This approach yielded 16 units for LAB's 'a' and 'b' color dimensions and eight for the luminosity dimension, resulting in 2048 bins.Finally, we utilize the Quadratic Form Distance [6] based on the L2 distance to retrieve the color similarities.
Figure 8 shows that participants exclusively scroll through blocks with low image-target color similarities.For instance, in Task 63 (b), the first seven rows were quickly dismissed while participants examined more closely when the image-target similarity increased.We calculated the mean image-target color similarities for the respective top rows in the three image sets as 0.244, 0.580, and 0.453.These scores represent the 95th, 78th, and 86th percentiles regarding target-similarity.Considering CLIP feature similarities, they only rank at the 90th, 60th, and 24th percentiles.
Lastly, we check if the color similarity generally explains whether images are ignored or fixated.We find that images more similar to the target are more likely to be fixated at least once.Comparable to the CLIP similarity, the differences between the mean values of images with and without fixation points are moderately substantial, corresponding to the 57th and 70th percentile.We suspect that the scrolling pattern is fostered not only by a low image-target similarity but also by a high intra-block image-image similarity at the beginning of the page.For the three tasks in Figure 8, the first seven rows have a mean pairwise image-image similarity of 0.985, 0.956, and 0.955.We believe that this pattern of image arrangement and its influence on the viewers' attention is a possible area for future research.

Qualitative Results
Based on the post-study questionnaire and the participants' comments, we also retrieved qualitative results.Regarding task difficulty, eleven participants rated the tasks as very easy or rather easy, six as moderate, and three as rather difficult (see Figure 3).When asked which layout they found more convenient to use, six people answered with definitely 4 columns, seven with rather 4 columns, one with indifferent, four with rather 8 columns, and two with definitely  8 columns.Thus, most participants were in favor of the 4-columns layout.When voting for the more efficient layout, the 4-columns layout was also more popular, but with fewer votes: five people answered with definitely 4 columns, six with rather 4 columns, one with indifferent, six with rather 8 columns, and two with definitely 8 columns.One participant commented that the preferred grid structure depends on the monitor size, as images would become larger on wider monitors.Similarly, five participants argued that the preferred layout depends on the similarity between the target image and the query images, but also within the query images.When the target image and the majority of the query images significantly differed (e.g., significant color differences), the 8-column grid was considered to be more efficient, as participants did not have to focus on details and could just scroll over the images.For very similar grid images that were also similar to the target image, participants had to examine the images in more detail, making the 4-column with larger images more suitable for multiple participants.Moreover, some comments suggested that participants found the 4-column layout less overwhelming, as fewer images and, thus, less information were displayed at a time.In addition, two participants suggested to sort the grid images by color, as checking for color similarity (followed by shape) was their primary strategy to retrieve candidate images, where they focused on detailed comparison.In general, the participants were able to work with the application and to solve the tasks with the intended workflow.The option to view the query image by holding the space bar was highly appreciated and frequently used.Four participants suggested to add a side-by-side or overlay comparison functionality for the selected image and the query image.While we agree that this feature might help to avoid selecting similar, but not the exact target image, we did not integrate this feature on purpose, as it contradicts the conditions of real-world video retrieval scenarios.The only issues that were mentioned by participants were that performing the image retrieval for some time can be exhausting (six participants) and that the size difference of the target image and the query images in the grid made it more difficult (two participants).

Discussion and Future Research Directions
The study field offers promising future research directions: Firstly, the implementation of a more precise eye tracker (frequency and spatial resolution as well as accuracy), as indicated by uncertainties described in Table 1, is imperative.Such an upgrade could determine whether the recorded tracking discrepancies are attributable to device limitations or actual user behavior (the "amount" of peripheral vision).Secondly, the study's methodology could be broadened.Altering the submission form, either by omitting the pre-selected image or by juxtaposing the selected and target video frames, would be interesting and relevant to explore.This modification could lead to more fixation data, albeit with a potential trade-off in realism, and offer insights into user discernment of near-duplicates.Moreover, future research may investigate the effects of the user's experience level (novice vs. expert) and explore how the results of our work transfer to comparable user interfaces in other domains.

CONCLUSION
In this work, we provide a detailed analysis of known-item search tasks within the domain of video retrieval, employing eye tracking methodologies.The framework of the study is meticulously described, including a thorough discussion of the limitations inherent in the data utilized.Our findings provide several orthogonal aspects inherent in visual known-item searches: Notably, the outcomes demonstrate a marked enhancement in search efficacy when utilizing 4-column grids, specifically in scenarios involving directly located items, that is, excluding instances of overlooked items.The analysis of fixation statistics lends empirical support to the hypothesis posited in preceding literature regarding a predilection for the central area of the visual field.However, it is noteworthy that, on average, all columns garnered a degree of attention.The data also revealed a positive correlation between the frequency of fixations on items and the similarity between the items and the target image.The results also comprise opinions of participating users, who preferred a 4-column setting as well.This research lays the groundwork for future directions, ranging from analyzing the detected patterns with higher-resolution eye trackers to comparative studies involving ranked sets in varied configurations, as well as investigations into innovative image presentation approaches.

Figure 1 :
Figure 1: Visualization of eye tracker data.Left: 4-column grid with collected eye gazes for one user and task.Right: eye gazes chronologically connected with lines for 15 tested tasks of one user.

Figure 2 :
Figure 2: The visual interface used in the study.Controls are at the top, while the image grid (here: 4 columns) fills the main area.

Figure 4 :
Figure 4: Searched target frames (top row) and submitted incorrect frames in the corresponding task (bottom row).

Figure 5 :
Figure 5: Eye fixations heat maps.(a) and (b) depict page-wise heat maps, while (c) and (d) depict screen-wise heat maps, i.e., ignoring scrolling.Black areas correspond to image margins.

Figure 6 :
Figure 6: Examples of task-wise eye tracking heat maps.Heat maps were aggregated for all users solving task 15 using a 4-column grid (a) and task 65 using an 8-column grid (b).The background color indicates the semantic similarity to the target image (lighter: more similar).Note the abruptly stopped gaze around the target image row, indicating a find.The same behavior is visible for both fixations (a) and raw eye gaze data (b).

Figure 7 :
Figure 7: Probability of existing fixation on items w.r.t. the column (left) and row (right) on which they are displayed.

Figure 8 :
Figure 8: Examples of scrolling behavior.Fixation heat maps aggregated for all users solving tasks 22 (a), 63 (b), and 67 (c) using 4-columns layout.Scrolling behavior was detected for the first 8, 7, and 17 rows, respectively.In all cases, fast skimming over the top rows with low LAB image-target similarity (indicated via the background color) is apparent.Bottom section is cut for clarity.

Table 2 :
Statistics for the search results and answer times for four (4C) and eight (8C) column grids.Search time measured until submission since a browsable grid first appeared in a task.Blue color highlights noteworthy differences and bold font indicates statistical significance.

Table 3 :
The number of tasks that ended with "not in list" option and had more than k consecutive eye gazes detected in the target frame & detection margin area.Blue color indicates that all targets were detected.

Table 4 :
Probability estimate  (solved | target in column i) that target was solved in a given column during the first pass FP (without soft overlook) and overall O. Values represent %, bold font indicates highest values, blue color lowest values.

Table 5 :
Pearson's correlation for item-target similarity vs. item's eye gaze quantity and the duration of fixations.