Predicting Representations of Information Needs from Digital Activity Context

Information retrieval systems often consider search-session and immediately preceding web-browsing history as the context for predicting users’ present information needs. However, such context is only available when a user’s information needs originate from web context or when users have issued preceding queries in the search session. Here, we study the effect of more extensive context information recorded from users’ everyday digital activities by monitoring all information interacted with and communicated using personal computers. Twenty individuals were recruited for 14 days of 24/7 continuous monitoring of their digital activities, including screen contents, clicks, and operating system logs on Web and non-Web applications. Using this data, a transformer architecture is applied to model the digital activity context and predict representations of personalized information needs. Subsequently, the representations of information needs are used for query prediction, query auto-completion, selected search result prediction, and Web search re-ranking. The predictions of the models are evaluated against the ground truth data obtained from the activity recordings. The results reveal that the models accurately predict representations of information needs improving over the conventional search session and web-browsing contexts. The results indicate that the present practice for utilizing users’ contextual information is limited and can be significantly extended to achieve improved search interaction support and performance.


INTRODUCTION
Predicting representations of users' information needs is an important challenge for search and recommendation systems as user's context, and interactions are increasingly used for personalization, query suggestion, query auto-completion, and predicting instant or proactive search results (e.g., instant entity cards or recommendations provided by digital assistants) [18,24,35,44,46,49,57,60].
95:2 T. Vuong and T. Ruotsalo These representations are based on previously collected data about users' conversations, queries, and other interactions.However, existing research has mainly focused on developing predictive models that utilized interactive data from a single service, such as queries and clicks from a search engine, but not the broader context of a user's digital activities and interactions over a more extended period of time.While not all information needs can be deduced from interactions, such as ad hoc needs raising from a knowledge gap that a user encounters, many information needs are associated with broader digital activities that users engage with, thus being part of a more extensive sequence of digital activity.
The challenge in such scenarios is that representations of users' information needs are often embedded within a digital activity context beyond the interactions immediately visible to the retrieval system.Consequently, current approaches for modeling information needs have limitations in accessing such data [6].For example, a user can be engaged with a task using word processing software or completing a financial transaction that defines the context for their information needs [68].These tasks may require a user to search for information to support producing or validating the claims in the text.To this end, information needs often do not appear out of the blue but depend on the user's task context [39].However, current search systems only provide limited support for contextualizing search beyond the present search session and have limited capability of modeling users' broader digital context [6].
Previous works on predicting representations of information needs are limited to estimating user context occurring within search applications or implicitly through Web browsing behavior [68], immediate pre-search context [34], partially relevant session context [55], or focused on query suggestion models for predicting the next query in the session given the history of previous queries [57].While these models have been shown to be successful, they are limited in search-session context or immediate web context preceding searching.As a consequence, many models have been proposed, but our understanding of what contextual information and from which sources are useful for predicting users' real-world information needs remains elusive.
Here, we report an experiment with 20 individuals in which a digital activity monitoring and screen monitoring system was installed on their laptops 24/7 for 14 days (refer to Figure 1).Data comprising screen content of Web browsing and non-Web application activities, Operating System (OS) logs, and clicked documents were used to model users' search contexts.Transformer models were fine-tuned on the context data for predicting representations of information needs and used such representations for: (1) Query prediction: predict future queries that would be submitted to the search system; (2) Query Auto completion (QAC): given a prefix, predicting the completion of the query; (3) selected search result prediction: predicting the first clicked page from search results; and (4) Web search re-ranking.
The results indicated that the performance of models trained with richer digital activity contexts, including Web browsing and non-Web activity context, was superior in predicting information needs for single-session queries and on par with models leveraging search session context when longer query sequences were available.
More specifically, our contributions are: -We contribute an in-the-wild digital activity recording experiment with 20 users and train a transformer model for each user's data to predict contextualized representations of information needs.-We show that digital activity beyond search-session context improves the estimation of representations of information needs for query prediction, selected search result prediction, and Web search re-ranking over conventional pre-search and search-session context data.
Predicting Representations of Information Needs from Digital Activity Context 95:3 Fig. 1.The experiment procedure includes (1) 24/7 continuous digital activity monitoring of participants' laptops for 14 days, (2) the context was extracted from digital activity monitoring data, (3) the context data was used to build models for the predictions, and main results were reported, and (4) further ablation analysis was conducted to assess the impact of different variables on the accuracy of the predictions.

RELATED WORK 2.1 Predicting Representations of Information Needs
Theories of information needs emerge from two interleaved research areas: information retrieval research developing methods for information-or answer-finding systems as well as from information science research that focuses on user-oriented studies of information needs [14].Both views acknowledge contextual factors triggering information needs and how information needs occur as part of broader user tasks [5,8].Researchers have also recognized the importance of developing models to capture and make predictions of representations of information needs based on user context.Methods have been developed to predict how and when information needs arise and for predicting user needs to improve ranking performance and interaction with information retrieval systems [41,55].For instance, by anticipating user needs, search engines can provide tailored results and personalized content [58,72] or provide search results before users explicitly search them, enabling users to save time and effort [35].A key finding from an information-seeking point of view has been the association between information needs and users' broader knowledge formation contexts, such as work tasks [32,33], conversations [55], or a problem-solving effort [7,66], which all highlight the importance of users' digital activities that are connected and interleaved with the actual search activity [14].These can be observed, for example, by analyzing text content the user is consuming before and after searching via documents and language processing, or behavior, such as Web browsing behavior, or query analysis from log data recorded during information seeking.Language usage analysis can also involve examining conversations that reveal broader communication context during information-intensive tasks [3].Such an approach can help determine the user's search intent, focusing on a particular topic in a conversation and allowing for more targeted content to be delivered.Studying Web browsing behavior often involves, for example, monitoring users' activities on the Web prior to searching, such as user clicks [38,68] or page views [11,69] to determine what information the users are selecting and consuming [22].For instance, Chi et al. [11] identified users' information needs based on their Web browsing history to predict future navigation patterns.These types of models have allowed development of downstream tasks that support search activity, such as personalized recommendations and more advanced interaction options to direct search [68].Modeling information using query analysis has involved methods that track query sequence patterns [55].Given an initial query, the aim is to determine the subsequent queries the user is most likely to enter.Many query analysis studies have relied on different kinds of context information, or parts of it, to build predictive models.However, the recorded data is usually limited to a single application, such as a Web browser, or even data from a single service, such as search engine logs.Here, we consider all combined digital contexts available from users' computers.We do not restrict monitoring to any specific application or service but capture user behavior and language use across different types of applications, including searching, Web browsing, emailing, and other desktop interactions (e.g., file access).This approach allows us to obtain a holistic view of user behavior and context to understand the importance of different types of contextual information and create more accurate predictions of user information needs.

Digital Activity Monitoring
Digital activity monitoring records user activity, including content delivered, interacted with, and perceived through computing devices [61,63,64].It has been used as a proxy for understanding user behaviors and needs [30,65] more precisely than what is possible through simple browsing behavior or click analysis.Digital activity monitoring involves gathering large-scale data from digital activity such as website visits [1], online searches [30], and social media interactions [73].This data can then be analyzed to uncover user preferences, such as the type of search results that users find the most helpful.For example, researchers have used digital activity logs to filter search results by relevance [26,28].Teevan et al. [58] considered a user's prior interactions with a wide variety of content (e.g., Web visits, documents, and email the user has read and created) to personalize that user's current Web search.Search logs can also be used to identify user trends, such as the topics they are most interested in [30].By analyzing digital activity, search systems can better understand user needs and tailor results to the individual.However, deploying digital activity monitoring at scale could invade privacy, as companies can access detailed data about an individual's activities [9].In contrast, other lines of research have started to investigate technologies that empower individuals to monitor their personal data, allowing individuals to track and analyze their digital footprints [56].
Most of the approaches mentioned so far have focused on monitoring specific applications or other limited information sources.However, it is also possible to monitor the entire visual content of the computer screen for predicting user information needs in various contexts, such as educational settings, office applications, and web browsing.This monitoring approach has been shown to be useful for proactive information retrieval [63].Such an approach could capture all digital activity contexts that other logging systems could not access.This includes monitoring all Web activity, application usage, social media posts, and any other digital activity that could impact the prediction.
This technology could also be integrated into existing services, giving users more control over their data.Despite the ethical and privacy concerns and developments to empower users to take control over their data, academic research has had limited focus on how services could be improved with such data and, more importantly, how practical privacy implications that digital activity monitoring may entail could be prevented.For example, what kind of data is valuable, how extensively it should be recorded and stored, and what performance tradeoffs personalization of search and recommender services may face if they are allowed or prevented from using digital activity data.
Our work aims to fill some of these gaps by studying long-term digital activity monitoring.This work employs state-of-the-art transformer architectures that model user activity sequences by capturing contextual data from a user's daily computer usage.The model takes a sequence of text as an input and generates a contextual output sequence to represent the user's information need.The output is then used to anticipate search queries, completion of queries, and contents of selected search results.

Contextualizing Search
Contextualized search takes into account the preceding context of the search query to provide more accurate search results.Search context has been considered primarily using search logs as Predicting Representations of Information Needs from Digital Activity Context 95:5 a source for contextual information [68].Search logs have been suggested to contain information about the user's needs and subsequently be used to predict what the user would search for in the future.This information has been used mainly to suggest related queries [44] or topics the user might be interested in [68].Research has also shown that leveraging search contexts [68] and interaction history [50] could help to predict user interests better and improve the accuracy of search result rankings.Most previous works focused on leveraging context for prediction within a search session.For instance, Jones and Klinkner [29] proposed a model built based on Web search logs.User sessions were extracted by grouping the queries that shared the same search goal.Then, previous queries within a session were considered as context for predicting the subsequent query.However, the results revealed that predicting queries without a context derived from the last submitted query was difficult and suggested that such an approach is dependent on the present search session and immediately preceding queries.
Another line of work focused on the task of QAC [25,27,43].Given a few characters inputted by the user, the goal was to identify the query that a user intended to write by considering the user's past search history [54].To achieve this, search logs were used to extract query co-occurrences [52,54], and this data was used for query completion prediction.Instead of relying on search logs, other works considered a variety of context sources, such as using users' click information [42], interaction patterns during search [51], or task information [23,41].Jain et al. [27] proposed an end-to-end system to generate synthetic suggestions using query logs.The prediction model took the current query and the previous queries in the same session as input and outputted a set of reformulated queries.The assumption was that leveraging search session context could capture user information needs and be used for query prediction.By analyzing the user's past search queries, the system could better understand what the user was looking for and suggested more relevant queries.However, such log-based methods suffered from data sparsity and were not effective for rare or unseen queries [57].
Prior research works were successful in modeling the user's information needs in a search session.However, the information about the previous queries was not always available, such as in the case of cold-start users or the first query in a session.They did not consider the full context surrounding searches that could be part of everyday digital tasks.Instead of trying to predict a query directly from the search session context, it is possible to learn how user information needs could be inferred from other sources of context.For instance, Zamani et al. [72] considered situational context (time and location); Andolina et al. [2], Vuong et al. [62] considered spoken conversations among people; and Kangassalo et al. [31] focused on users' cognition.Unlike those previous works, in this article, we considered five sources of context: Web browsing context, non-Web browsing context, prior page context, and full combined context (both Web browsing and non-Web applications).

Transformers for Information Need Prediction
Recent advances in sequential modeling have also contributed to the understanding of the effectiveness of contextual data.In particular, transformer architectures have shown performance improvements over conventional models in sequential predictions in information retrieval tasks [44].Transformers have been used to capture dependencies between queries and terms by refining each token representation based on its context [17].Query terms are often repeated throughout a session, and user information needs could be captured in this way [20].
Transformer architectures have also been successful in direct query prediction.For instance, Nogueira et al. [45] and Han et al. [21] applied transformers to infer a query from a document.Han et al. [21] used the pre-trained transformers and showed that expanding the document with the predicted query improved the ad hoc retrieval results.In contrast, Nogueira et al. [45] presented a more complex seq2seq architecture: The encoder included a Graph Convolutional Network and a Recurrent Neural Network (RNN), and the decoder was a transformer.Transformers have also been used for ad hoc retrieval [15,40,48,71].MacAvaney et al. [40] used Bidirectional Encoder Representations from Transformers (BERT) features in existing ranking neural models and outperformed state-of-the-art ad hoc ranking scores.Some works have explored the use of the longterm search history of users [10], using an RNN-based architecture, to rank query suggestions.In this work, we did not restrict to queries' contexts in search sessions as input data, but other sources of contexts could also be added to a predictive model.
Garg et al. [20] applied a Hierarchical Transformer to query suggestion task, and that outperformed the RNN-based model.Their model consisted of two encoders: a token-level and a querylevel one.The former one gave a contextualized representation of each token that depended on the other tokens of the query, while the latter one outputted a contextualized representation of each query, depending on the other queries of the session.
However, research on approaches predicting user needs in real-life digital activities beyond immediate Web browsing or query sequences has been limited.Therefore, it is unclear whether the models themselves or simply richer data sources play a more critical role in advancing the use of context information for information retrieval.Our work shows that richer data, indeed, is critical for the performance of the models.We show that different types of data and sequential context has significant effects on prediction and subsequent reduction of user effort and system performance.

DATA COLLECTION
We conducted an in-the-wild study to collect users' everyday digital activities.We equipped volunteers with a digital monitoring and screen recording system that had access to all interactions and content visually presented on a computer screen, including Web pages, emails, word processing documents, and other application windows.The purpose of the in-the-wild data collection was to capture content and digital activity that was invisible to conventional loggers that often monitored only search activity or Web activity.

Participants
Twenty individuals participated in our study.Of the 20 participants, 10 were males, and 10 were females.Average age of participants was 44 (SD = 18.3).All participants had completed a bachelor's degree, and their working language was English.
The participants were recruited via a posting that was distributed to mailing lists.A questionnaire was attached to the recruiting message, which was sent to the relevant mailing lists to collect background information on potential candidates.Only respondents who used laptops as their main devices for everyday digital activities were considered eligible for joining.All participants had laptops of recent models and had Windows 10 OS installed on their laptops.
The participants were informed of our privacy guidelines prior to joining the experiment.They were told that the monitoring of digital activity data was stored on their computers during the monitoring phase.Afterward, the data would be transferred to a secured server and used only for research purposes.After the experiment, the participants were compensated with 150 Euros.A consent form was obtained from the participants regarding the data usage, privacy, and experiment procedure.

Apparatus
The digital activity monitoring and screen recording system continuously recorded screenshots at 2-second intervals and OS log information associated with screenshots, including the titles of active windows, the names of active applications, the Uniform Resource Locators (URLs) of Predicting Representations of Information Needs from Digital Activity Context 95:7 Web pages on active applications, and timestamps.In addition, the system collected keystrokes and mouse behaviors, including clicks and scrolls and associated timestamps.The digital activity monitoring and screen monitoring system was developed in the Microsoft Windows OS version.We used a Desktop App UI to implement the monitoring system.The system performed monitoring functions: saving active windows as images and collecting the aforementioned OS log information.In addition, a stopping function was implemented to allow the users to stop the monitoring.This feature was necessary for ethical data collection to ensure the participants' privacy and to allow them to opt-out of the data collection process.However, we observed that the screen monitoring was operational well over 95% of the time and was turned off less than 30 minutes per participant, on average.Consequently, we believe that the data used for the analysis include representative information about the participants' digital activity.

Digital Activity Monitoring
Upon agreeing to participate in the experiment, the digital activity monitoring and screen recording system was installed on each participant's laptop and set to run continuously in background mode for 14 days.Participants were explained that the monitoring system was automatically launched whenever the laptop was turned on.The participants were advised to use their laptops as usual and to avoid stopping the monitoring unless necessary during the monitoring phase.After 14 days, the participants visited our lab, and the digital activity monitoring system was uninstalled from their laptops.The monitoring data was then transferred to our secured system.

Screen Recordings Pre-processing
Screenshots were converted to text units using Tesseract 4.0, which is a very accurate Optical Character Recognition (OCR) engine [12].Based on a pilot study conducted on manually selected samples of screenshots, the pre-processing pipeline was designed to improve the accuracy of the OCR: First, a screenshot image was enhanced to make the text more visible using textcleaner 1and scaled to 500% using convert. 2 Then, text was extracted using the OCR engine.OCR-processed text units and associated metadata (OS information) were stored chronologically in a sequence.We merged text units that belong to the same document using window titles and URLs.This resulted in a timestamped sequence of documents for each participant.A document describes a tab switch or file change having text data representing information contained within an associated application window that has been interacted with or examined by a user.This can include text data on a textual document, an email, a folder, a file, an instant message, a Web page, and an application window with a unique title.
To extract information on the screen only once and avoid duplicate information to be attached to a document, we considered information changes on the screen.For this process, we utilized a frame difference technique in which the two temporally adjacent screenshots (of a single document) were compared and the differences in pixel values were determined.That is, words that appeared in the same pixels in the two adjacent screenshots were excluded from the document.In addition, we excluded headings, menus, sidebars, and footnotes in OCR process.To do this, for each application window, the frame difference technique also resulted in information about pixel areas of those elements.Appendix A shows an example of pre-processing of screen recordings and how the OCR engine processed the screenshot.

Query, Selected Search Result, and Search Session Extraction
We extracted the participants' Web search queries and associated selected search results from digital activity logs for evaluation.We ran a script programmed to automatically identify all Web searches and queries entered into commercial search engines including Google Search, Duck-DuckGo, and Yahoo Search.Search engine usage was identified in the Web URLs of the collected logs.The queries were then extracted directly from the URLs.The corresponding selected search results from the SERPs following the queries were also extracted from the logs.From query data, we determined search sessions using a session extraction methodology similar to a previous work [69].We delimited sessions with a 30-minutes timeout between two queries, that is, a search session begins with a query and terminates following 30 minutes of user inactivity.

Ethics
We are very aware of the privacy implications of using screen recording data for research and have taken active steps to protect the participants.In particular, to safeguard participant privacy during the experiment, all digital activity monitoring and screen monitoring data were encrypted, stored locally on participant laptops during the recording, and never exposed to anyone except the participants themselves.We followed the ethical guidelines and principles of data anonymization and minimization at every stage of data processing.The logs were archived and stored on a server protected by authentication mechanisms and firewalls.The research was approved by the ethical committee of the University of Helsinki Ethical Review Board in the Humanities and Social and Behavioural Sciences3 and complied with the declaration for managing data obtained from human participants.The participants were also informed that their data would be destroyed upon completion of the research.

MODEL
In this section, we describe the approach for predicting representations of information needs, which is then applied in predicting queries and selected search result documents.

Predicting Representations of Information Needs
The main notation used is described in Table 1.We denote a user's digital activity D consisting of a sequence of documents d 1 , . . .,d |D | .
Given D, we divided the user activity into sequence slices with a fixed-sized sliding window of size n.Each sequence slice was formed of {d 1 , . . .,d n }, where every The transformer model is trained to extract representations of user information needs within the observed sequence and predict the next document denoted as d n+1 (see Figure 2).
Predicting Representations of Information Needs from Digital Activity Context 95:9 We built on the Bidirectional and Auto-Regressive Transformer (BART) model [37,59] and fine-tuned it using the digital activity data.We used the weights of BART trained on CNN/DM,4 a news summarization dataset.
Given a document sequence (d n , d n+1 ), the aim was to optimize the parameters θ that maximized the log probability of observing the dataset: where (w 1 , . . .,w |d n+1 | ) are tokens of d n+1 .
An activity sequence slice or the input of the transformer was simply the concatenation of all the words of all the documents separated by a token [SEP], such that the [SEP] was used to mark the beginning and the end of the sequence slice: This sequence slice was then transformed by using the token embeddings added to positional embeddings (one per distinct position).Such that an input sequence of {w A contextualized representation h for each token e in the sequence was obtained with the Encoder E as follows: The decoder then generated an output sequence of words representing a user's future information need.At each step, the model is auto-regressive, consuming the previously generated words as additional input when generating the next word.

Query Prediction
We applied the model for predicting representations of information needs described above to the query prediction.The goal of query prediction was to suggest a ranked list of queries.Given a search activity s, we extracted an activity context d n , such that s followed the activity context.Here, a search activity s was represented by a SERP and we excluded empty search pages in the data and modeling.Then, the information need prediction model was used to predict a SERP (s = d n+1 ) that followed the observed sequence slice d n .Then, the generated s was used for generating the query suggestions.
Given the generated s, word-level n-grams were extracted and considered as query suggestion candidates c i .Then, a word-embedding approach was utilized to rank the query suggestion candidates.The idea was those suggestion candidates that were semantically related to the query context d n (an activity sequence prior to the SERP) would be ranked higher.First, c i and d n were transformed into word embeddings using the pre-trained Google News embeddings.Then, c i was ranked based on distance cosine similarity with d n .We considered pre-trained Google News embedding for ranking so it would be independent from the activity data.
Top-k query suggestions were generated by sorting the candidates in descending order.That is, query suggestions that were most consistent with the future information needs were retrieved.
Predicting Representations of Information Needs from Digital Activity Context 95:11

Query Auto Completion
The aim of QAC was to predict the intended query of a user given a prefix.We assumed that the intended query was the actual query that was entered by a user.We simulated the query autocompletion scenario by providing characters one-by-one as a user would for a query.Each query was therefore decomposed into a set of prefixes R = {r 1 , r 2 , . . ., r j } of length j, where j was equal to the number of characters required to enter the entire query if no query completion interface was available.
Given each prefix r j , we retrieved query auto completions matching the prefix from the original query suggestion pool c i of the generated s.Given the query context d n , we used the same wordembedding approach to rank the query auto completions.Last, top-k query auto-completions were retrieved for the evaluation.

Selected Search Result Prediction
The goal of selected search result prediction was to predict the first clicked document on the SERP.Similar to the query prediction task, we applied predicted representations of information needs to this task.A difference, however, was that the data was manipulated.Here, we considered a search activity s as a clicked document and applied a constraint that excluded all the SERPs in the data for training/validating/testing.This means that the step of entering a query into a search engine and retrieving search results was omitted in the data used for this prediction task.This simulates a situation where a user's contextual information is pro-actively employed to retrieve information that aligns with the user's specific information needs, much like the process of obtaining instant search results.Queries without following clicks to documents in SERPs were excluded from the prediction and evaluation processes.
Therefore, given a clicked document s, we also extracted an activity context d n , such that s followed the activity context.The general model was trained on the newly constructed data and used to predict the future selected search result (s = d n+1 ).

Web Search Re-ranking
The purpose of Web search re-ranking was to re-rank the selected search result documents given an activity context.Top-10 search result documents that were retrieved by a user in response to each query were re-ranked using various activity context models.To do this, we first scraped the content of those search result documents using the links that were recorded by the monitoring system.The content and comment extractors of the Dragnet [47] were used for the content extraction. 5o produce document rankings based on different context models, embedding of a predicted clicked document d n+1 produced by transformer models were used to re-rank the corresponding retrieved search results.BM25 was used as a retrieval model, and we used the predicted embedding vector as the input.

Model Training
For each participant, data was split into training, validation, and test sets.The model was trained on the sequence of documents.The training set consisted of the data of the first 8 days.The validation set consisted of the data of 2 subsequent days.The test set consisted of data of the remaining 4 days.This approach ensured that the queries and the sessions in the training and validation sets were not to be seen in the test set.We described more details about the data used for each context model as follows: - For the evaluation, we set the sequence length to 4 documents for Full context, Non-Web context, and Web context models.We used a 4-document sequence slice because, from the pilot study, we found that having more than 4 documents in the sequence did not significantly improve the prediction performance.Using this setting would also have the advantage of being faster and easier for fine-tuning the model (Analysis of sequence length is reported in Section 8.2).We fine-tuned BART on the context data for 20 epochs and used gradual unfreezing for the text generation.
Predicting Representations of Information Needs from Digital Activity Context 95:13

Baselines
We reproduced an encoder-decoder model based on LSTM-RNN architecture proposed by Dehghani et al. [16] and used it as a baseline.Originally, this model was designed for translation tasks [4] and was adapted for query prediction by Dehghani et al. [16].The model architecture consisted of an encoder that learned the representation for the source sequence and a decoder that generated the target sequence.In this research, the source sequence was the text concatenated from 3 previous documents, and the target sequence was the text of the target document.This modeling setup was the same as for our transformer model using full context, in which the model leveraged a 4-document window, with the text of the 3 prior documents considered as the source sequence.Query prediction & QAC.Prior page context, search session context, random context, and LSTM-RNN were considered comparison baselines.This way, we could examine the effectiveness of using full context source that goes beyond the context considered on a prior page context and search session context on prediction performance.Selected search result prediction.Prior page context, random context, and LSTM-RNN were considered comparison baselines.Web search re-ranking.Prior page context, random context, and LSTM-RNN were considered comparison baselines.In addition, we consider non-contextual ranking as a baseline.To produce non-contextualized ranking, the actual query issued by a participant was used.BM25 was applied to re-rank the top 10 search result documents using the content of the actual query.

Evaluation Measures
We used the actual queries and actual clicked documents in the test sessions as the ground truth for the evaluation.Query prediction and QAC.We first used Mean Reciprocal Rank (MRR) and Partialmatching MRR (PMRR), which are often-used metrics to evaluate query prediction and QAC [46].These measures are considered to be useful in information retrieval research [53] despite some discussions regarding their effectiveness [19].
where |Q | is the number of all queries, r q is the rank of the original query among the candidates, and pr q is the rank of the first candidate that partially matches the original query [46].Because MRR was too harsh (this metric considers only the exact match of the original query), we also used the classical metric BLEU, which corresponded to the rate of generated n-grams that were present in the target query.We referred to BLEU1, BLEU2, and BLEU3 for 1-gram, 2-grams, 3-grams.
Sim Extrema, which computed the cosine similarity between the representation of the candidate query with the target one, was also used.The representation of a query is a component-wise maximum of the representations of the words making up the query (we use GoogleNews embeddings).The extrema vector method has the advantage of accounting for words carrying information instead of other common words of the queries.
We also computed Sim Pairwise as the mean value of the maximum cosine similarity between each term of the target query and all the terms of the generated one.
Last, for each metric, we averaged over all prefixes for the performance of QAC and query prediction.In this article, we considered 0-8-character prefixes.Selected search result prediction.We considered Sim Pairwise, Sim Extrema, BLEU1, BLEU2, and BLEU3.We evaluated the generated documents compared to the original clicked search result document.For BLEU1, BLEU2, and BLEU3, we compared the generated content to the title of the original clicked search result documents.
For each model, we first generated (through a beam search with K = 20) 10 documents to suggest to the user, given the context sequence.Then, the reported value for each metric is the maximum score over the top 10 generated queries or selected search result documents.This approach has been used in the early work for assessing the performance of a probabilistic model [36], which corresponded to a fair evaluation of models that tried to find a good balance between quality and diversity.Web search re-ranking.We considered MRR and Hitrate@k by considering ranks of selected search result documents in the re-ranked list.Hitrate@k denotes the average percentage of the selected search result document that can be found in the top-k ranked documents.Here, we considered Hitrate@1, @2, and @3.

Significance Testing Procedure
We performed significant tests on the results by comparing our richer context models against the baselines: LSTM-RNN model and random context model.Paired t-tests were applied for comparisons between the richer context models and the baselines.To test the significance levels, we used MRR, PMRR, BLEU, Sim Pairwise, Sim Extrema, and Hitrate@k as dependent variables and the models as independent variables.The p-values were adjusted using Bonferroni correction [67] for multiple comparisons.

Predicting Query and QAC
Table 2 shows results obtained by all the models for query prediction and QAC.The full context model improved the results on the test sessions over the conventional search session-based model and LSTM-RNN model on all metrics.MRR for the full context model is .038,and PMRR is .454,and there were significant differences in MRR between the full context model and all the baselines: LSTM-RNN, search session, and random, with p < .02.For PMRR, significant differences were found between the full context model and all the baselines (LSTM-RNN, prior page context, search session context, and random context) with p < .0004.
Sim Pairwise and Sim Extrema for the full context transformer model were .492and .418,respectively.While there were significant differences between the full context model and two baselines (prior page context and search session context with both p < .002),there were no significant differences in comparison to LSTM-RNN.Likewise, significant differences in BLEU1 were found for the full context model in comparison to two baselines (prior page context and search session context with both p < .004).However, no significant difference was found when comparing the full Predicting Representations of Information Needs from Digital Activity Context 95:15 Fig. 3. Results for query prediction and QAC.X-axis presents the number of characters in prefixes.
context transformer model with LSTM-RNN for BLEU1.For BLEU2 and BLEU3, the full context model outperformed all the baselines with p < 0.001.
Figure 3 shows results for character-level prefixes obtained by all the models for query prediction and QAC.The results showed that transformer models utilizing the full context performed better than other models.The longer the query prefix as an input, the better the prediction results.
Overall, the results suggest that transformer models were significantly better than the LSTM-RNN model when the full context was used as input.However, when the context was restricted to a prior page or search session, the transformer models did not perform better than the LSTM-RNN model in terms of Sim scores and BLEU1.

Predicting Selected Search Result Documents
Table 3 shows results obtained by all the models for selected search result document prediction.The results show that full context predicted user clicks most accurately in all metrics.Sim Pairwise for the full context model was .521,and Sim Extrema was .339.BLEU1 was .255,BLEU2 was .063,Values with boldface denote significant differences from all the baselines and the non-contextualized BM25 model.and BLEU3 was .031.There were significant differences in Sim Pairwise and BLEU1 between the full context model and prior page context model with p < .002.However, no significant differences were found in Sim Extrema, BLEU2, and BLEU3.The results suggested that prior page context is a good context source for predicting clicked pages.Furthermore, the transformer model utilizing the full context model outperformed LSTM-RNN in terms of Sim Pairwise, Sim Extrema, and BLEU scores (p < .01).

Web Search Re-ranking
Table 4 presents the Web search re-ranking results, measured in terms of MRR and Hitrate@1, Hitrate@2, and Hitrate@3.The results show significant improvements in all measures when the full context model was used, compared to the BM25 model without contextualization (p < 0.001).MRR was 0.464, Hitrate@1 was 0.325, Hitrate@2 was 0.442, and Hitrate@3 was 0.503 for the transformer model using full context information.Specifically, both the LSTM-RNN and Prior Page Context models outperformed the non-contextual BM25 model in terms of MRR, Hitrate@2, and Hitrate@3 (p < 0.02).Moreover, the model trained on random context resulted in the worst retrieval effectiveness.Interestingly, the full context transformer model consistently outperformed the LSTM-RNN model by a large margin across all measures (p < 0.01).This suggests that the models trained with self-attention mechanism will be better suited to capture long-term dependencies of user digital activities compared to LSTM-RNN.

ABLATION ANALYSIS
We performed additional experiments by ablating the transformer model to analyze components that affect the performance of query and document prediction tasks.

Effect of Context Sources
Table 5 shows the results of comparison for models using different sources of context.The results show that the transformer model that used full context information obtained best performance compared to models that utilized individual context sources (e.g., Web or non-Web).
Predicting Representations of Information Needs from Digital Activity Context 95:17 For Query prediction and QAC, the Web context model performed better than non-Web context model in terms of MRR.MRR for the Web context model was .037.However, for metrics such as Sim Pairwise/Extrema and BLEU, the non-Web context model performed best and outperformed the Web context model.Sim Pairwise was .495,Sim Extrema was .421,BLEU1 was .177,BLEU2 was .045,and BLEU3 was .013.Significant differences were found between the model that considered non-Web context and the model that considered Web context in terms of Sim and BLEU scores (p < .02).
Furthermore, for selected search result prediction and Web search re-ranking, the non-Web context-based transformer model obtained higher performance compared to the Web context model.The full context model outperformed both the Web and non-Web context-based models for Web search re-ranking.This suggests that combining Web and non-Web context sources leads to improved performance of the models.

Effect of Sequence Information
To analyze the sequence information, we permuted the order of activity sequences in the full context model for training.With these thematic breaks and this noisy information added, we tested the models in terms of query prediction, QAC, and selected search result prediction to examine the benefit of sequence dynamics in user behavior.
Table 6 and Figure 4 present performance loss for every metric with the models using permuted sequences.The results showed that the corruption of the sequences considerably affected the performance.The permuted sequence model performed significantly worse than the full context model in most metrics with p < .03.The results indicated that sequence information in user behavior impacted the prediction and might correspond to users' information need dynamics.

Effect of Context Length
To examine the impact of training the model with different context lengths, we consider setting the sliding windows or the number of documents in sequence slices for training/validation/testing   6).
sets.Here, we set the sequence lengths to be 4-document slices, 6-document slices, 8-document slices, and 10-document slices.Then, we evaluated the models using different context lengths for query prediction and QAC performance against the best baseline, which is the prior page context model.
Predicting Representations of Information Needs from Digital Activity Context 95:19   7).
Table 7 and Figure 5 present the results for query prediction and QAC with the models using different context lengths.The use of longer sequences affected the performance of query prediction and QAC.For instance, using the 10-doc slice improved the model in terms of Sim Pairwise (.503) and BLEU3 (.022) and outperformed the baseline (the prior page context model) with p < .04.However, using a 4-document slice was already sufficient for prediction, as PMRR, Sim Extrema, and BLEU1 for the 4-doc slice model were highest with .454,.418,and .171,respectively.There were significant differences between the 4-doc slice model and the baseline (the prior page context model) (p < .03).

Effect of Session Length
To understand the effect of digital activity context on the query prediction and QAC with different lengths (number of queries in sessions), we performed experiments by splitting the test set into three groups: single-query sessions (with one single query forming 42% of the test set), short sessions (with 2-4 queries forming 39% of the test set), and long sessions (with 5+ queries forming 19% of the test set).
Figure 6 presents the results of the models' query prediction and QAC performance for different session lengths.For single query sessions, the model performed best when the whole context was considered, highlighting the importance of full context information.There were significant differences in BLEU1, BLEU2, and BLEU3 between the full context model and the prior page context model (p < .01).Significant differences were found in all measures between the full context model and the random context model (p < .001).For short sessions, the full context model outperformed the two baselines: search session context and prior page context in terms of PMRR, Sim, and BLEU1.Significant differences were found between the full context model and the baselines (p < .02).For longer sessions, the search context model performed equally well as the full context model, even though only a search context was available in this case.This was due to the fact that the search session contained sufficient context information for longer sessions and could be leveraged for prediction.
Generally, the results showed that as the session length increased, the performance of query prediction and QAC improved.Models for single-query sessions had the lowest performance.This is to be expected, as single-query sessions may not contain any useful context for prediction, but information needs could be intrinsically triggered or triggered by other external factors, such as appearing from a user's memory or spoken interaction with another person, and not available in the digital activity context.Fig. 7. Results for query position.q 1 , q 2 , q 3 , q 4 , q 5 correspond to the first, second, third, fourth, and fifth queries in the session.Search session context model has no context information for the first query in a session, therefore all measures are zeros in this case.

Effect of Query Position
We studied how the modeled context helps query prediction and QAC when a search session is progressing.We compared the performances of the predictive models at individual query positions in all search sessions.Figure 7 presents the results for the impact of the models on query positions.It is noticeable that prediction performance improves steadily as a search session progresses, e.g., the more search context becomes available for predicting the next query, the better the performance.
Compared to the baselines, Web and non-Web context models benefit from richer digital activity context, especially for predicting the first query in the session.There were significant differences in MRR, PMRR, and BLEU between the full context model and the baselines (p < .001).Search session context and prior page context models improve faster as the search progresses, because more context evidence within the search session becomes available.These models also exploit the context better for prediction.
An interesting finding is that when the search sessions get longer with more than four queries, the prediction using non-Web context performed better.There was a significant difference in PMRR, Sim, and BLEU1 between the non-Web context model and the search session context model (p < .03).By manually inspecting the test data, we found that when users engaged in longer searches having multiple queries, the tasks were complex, involving topically broad information needs or changes in search topics.In such cases, the search context may be too coarse for inferring users' needs.However, the context could be inferred from the use of non-Web documents.For example, word-processing software would be an important contextual information source to reveal a broader task context and improve prediction.However, investigating further details of search behavior and model performance for longer search sessions is left for future work.

Summary of Findings
We set out to study the effect of holistic digital context for predicting information needs: and use the representation of these needs to predict future queries, determine the content of search result documents that users would click, and improve Web search ranking.In contrast to prior work, which modeled context from search session [44], our approach was based on digital activity recorded 24/7 for the duration of 14 days via screen monitoring.The results indicate that it was possible to create robust user models from simple input just by monitoring what information an individual interacted with and examined on the computer screen.By incorporating the full context of an individual's digital activity, including interactions with applications beyond just the search history produced improved web ranking and query prediction compared to the session-based context model.This enables the development of general user models that can be learned across applications, leveraging data from multiple applications to solve issues such as cold-start problems [70].
Transformer models such as BART were found to be effective in the capturing of representations of user information needs from digital monitoring data.Compared to the LSTM-RNN-based model, BART consistently performed better in prediction tasks and Web search re-ranking in all metrics.These results are consistent with previous studies by Mustar et al. [44], Sordoni et al. [57], which also found that transformer models outperformed RNN-based models for query prediction.Given that monitoring digital activities can be easily recorded and accessed from the user's side, it promotes a new opportunity to learn user models directly from the user's own device.This method does not require access to the data structure of the service provider or application developer, but the personal data could be owned and used by the actual users for modeling given the appropriate platforms and computers.This further aligns with the recent development in data privacy towards user-centered personal data management and processing [13], empowering users to have more control and ownership over their data, avoid sending it to external services, and potentially compromising user privacy.
Although the model cannot predict information needs that are not tied into the digital context (such as conversation), the results showed that the predictions were better than those resulting from previous page, search session, or random.This indicates that there is information in the users' context about their information needs.While our goal was not to improve existing ranking and achieve practical system performance, but instead we show that contextual information is under-used and can have much higher utility than data from previous queries, search sessions, and web context alone, which are already extensively used in commercial systems.
Different sources of contextual information were studied.Our results show that the full context models were most effective, suggesting that users' long-term context information is beneficial for predicting information needs.We found that, generally, for single-query sessions and the first query within a session, rich context information yielded the best predictive performance.However, when users progressed through search sessions, the immediate search session context contained more information about the user's needs and showed comparable performance.Moreover, our results indicated that transformer models leveraging more extensive context sources, such as combining Web and non-Web data, generally achieved better performance.Nonetheless, it is important to acknowledge that models trained on narrower web contexts can still achieve commendable performance.Therefore, if the goal is to preserve privacy and keep user data private, then using only Web context may be sufficient and still produce reasonable results based on our experiments.However, the demonstrated performance of models trained using web context may be constrained to specific tasks, and further research is essential to explore various task categories and determine the requisite extent of contextual data to enhance their performance.

Limitations and Future Work
The data recording was restricted to two weeks of digital activity monitoring on personal computers.Therefore, we did not account for long-term user interest and needs that might span weeks or months of usage, nor did we gather data on mobile devices.We also did not separately analyze user preferences and needs that might be specific to certain tasks, contexts, or situations.For example, users might have specific information needs when they were in a particular location or when they were engaging in a particular physical activity.Additionally, the data collection was intended to extend to all digital activity.This required equipping computers with specific software and personal instruction of the participants.Such an extent inherently meant that the data was collected from a limited sample size, which may not be representative of the larger population.
Although our observations have provided us with valuable insights, the possibility of improving the prediction accuracy could have been enhanced by more extensive data.In addition, if we had access to more detailed information about the participants, such as their age, gender, and educational background, then we could have used this data to create more accurate prediction models.Our models were investigated especially on the dataset that was acquired from mostly knowledge workers.However, the collected data itself was not limited to knowledge work context and the results should generalize, to some extent, to other types of computer users as well.
The monitoring system introduced in this work also captures highly sensitive information.However, this is a common issue with most personalization systems.Therefore, we ensured that the data collected was kept secure and only accessible to authorized personnel by the use of encryption and security measures.We adhered to the ethical guidelines regarding the collection and storage of personal data.This included obtaining consent from users to collect and store their data, as well as ensuring data was only used for the purpose it was collected.We also informed participants that their data would be destroyed after the research is completed or after a certain period of time.However, we observed that some participants still chose to disable the monitoring temporarily during some activities.Therefore, we expect that research could reveal whether the concealed data could assist in automating the process of setting the privacy boundaries that users expect.
Our findings demonstrate the importance of considering privacy when designing search systems.For instance, while some users may be willing to share their data to improve search results accuracy, others may not want their personal information to be tracked or stored.Future work on search systems may consider providing users with privacy settings that allow them to balance their need for privacy with the accuracy of personalization.By giving the user more control, search systems could help ensure that users receive results that are tailored to their specific needs while still protecting their personal data.Studies could also explore whether the personalization services could remain at the user's device or edge servers rather than requiring privacy-sensitive data to be processed at the service provider's cloud.
Another limitation was that we did not conduct any experimental study with the controlled settings in the lab, and we examined in-the-wild data collection and real-world tasks.Because of that, it was difficult for us to obtain relevance assessment on every real-world search situation, as it might intervene with the existing task users were performing.However, by using a single data source (users' screen), we were able to create a rich user model without requiring any human supervision.In addition, by using actual user queries and clicks as ground truth, we were able to evaluate our proposed method in a realistic and practical way.
As the goal of this research was to study whether the use of digital activity context could predict user information need and improve the prediction performance over conventional search-session and pre-search context models, we kept the model architecture and hyperparameter tuning the same for all the models.This allowed studying the effect of different contextual sources, keeping all other factors invariant.However, we cannot exclude the possibility that some other model architecture would lead to better performance, and we leave this investigation for future work.

Conclusions
We presented a systematic study of the effectiveness of contextual information for predicting representations of information needs.The representations were subsequently used in various information retrieval tasks: query prediction, query auto-completion, selected search result prediction, and web search re-ranking.Our findings reveal significant differences across different types of contextual information used for training and subsequently in the downstream task effectiveness.
Our findings emphasize the significance of adopting a more holistic approach to digital activity context.Our results unequivocally establish that models trained exclusively on immediate presearch or search-session context exhibit notably diminished performance in comparison to models that utilize the entirety of contextual information.This observation implies that advancements in the fields of information retrieval and recommender systems stand to benefit not only from the development of novel models but also from a more comprehensive integration of extensive contextual information sources.Furthermore, this revelation prompts further scholarly exploration into equitable user data utilization and the intricate tradeoffs between predictive accuracy, user privacy, and the ethical treatment of user data.

Fig. 2 .
Fig. 2. Example of user digital activity context and the approach of applying the transformer networks to the data.Contexts were determined from different sources (Web, non-Web activities, prior page context).Transformer models were applied to the context data to predict what queries the user might submit and which search result documents they might select.
Full context model: There were 26,225 documents in the training set, 6,565 documents in the validation set, and 14,055 documents in the test set.-Non-Web context model: There were 13,265 documents in the training set, 3,325 documents in the validation set, and 7,115 documents in the test set.-Web context model: There were 15,705 documents in the training set, 3,943 documents in the validation set, and 8,412 documents in the test set.-Prior page context model: There were 4,301 documents in the training set, 1,081 documents in the validation set, and 2,260 documents in the test set.-Search session context model: There were 540 sessions in the training set, 140 sessions in the validation set, and 260 sessions in the test set.

Fig. 6 .
Fig. 6.Query prediction and QAC performance in single-query, short, and long sessions.Search session context model has no context information for the first query in a session or in single-query sessions, therefore all measures are zeros in this case.

Table 1 .
Main Notation Used in the Article Notation Description D a user activity sequence comprises d 1 , . . .,d |D | .d n a context sequence slice (d 1 , . . .,d n ) extracted from D. d n+1 a document follows the context sequence slice d n .s a search activity and is d n+1 .It can be a SERP or a clicked page.w d n words (w 1 , . . .,w |d n | ) of d n .e d n encoded tokens of words (w 1 , . . .,w |d n | ) of d n .h d n contextualized representation of encoded tokens (e 1 , . . ., e |d n | ).

Table 2 .
Results for Query Prediction and QAC

Table 3 .
Results for Selected Search Result Document Prediction

Table 4 .
Results for Web Search Re-ranking

Table 5 .
Effect of Context Sources Boldfaced values denote highest values.

Table 6 .
Results for Sequence Analysis Values with boldface indicate significant differences (p < .05).