Take a Fresh Look at Recommender Systems from an Evaluation Standpoint

Recommendation has become a prominent area of research in the field of Information Retrieval (IR). Evaluation is also a traditional research topic in this community. Motivated by a few counter-intuitive observations reported in recent studies, this perspectives paper takes a fresh look at recommender systems from an evaluation standpoint. Rather than examining metrics like recall, hit rate, or NDCG, or perspectives like novelty and diversity, the key focus here is on how these metrics are calculated when evaluating a recommender algorithm. Specifically, the commonly used train/test data splits and their consequences are re-examined. We begin by examining common data splitting methods, such as random split or leave-one-out, and discuss why the popularity baseline is poorly defined under such splits. We then move on to explore the two implications of neglecting a global timeline during evaluation: data leakage and oversimplification of user preference modeling. Afterwards, we present new perspectives on recommender systems, including techniques for evaluating algorithm performance that more accurately reflect real-world scenarios, and possible approaches to consider decision contexts in user preference modeling.


INTRODUCTION
Out of all the papers published in SIGIR 2022, 27.5% of them have titles that include the words "recommender" or "recommendation". 1his is a strong indication of research interests on Recommender Systems (RecSys) in the Information Retrieval (IR) community.As evaluation is also a traditional research topic in IR, it is interesting to study how recommendation algorithms are evaluated in general.More interestingly, a few recent papers report counter-intuitive observations made from experiments on recommender system, both in offline and online settings [18,26,37,38,40].
Here are some example counter-intuitive observations.Ji et al. [18] report that both users who spend more time and users who have many interactions with a recommendation system receive poorer recommendations, compared to users who spend less time or who have relatively fewer interactions with the system.This observation holds on recommendation results by multiple models (i.e., BPR [33], Neural MF [14], LightGCN [13], SASRec [20] and TiSASRec [25]) on multiple datasets including MovieLens-25M, Yelp, Amazon-music, and Amazon-electronic.On a large Internet footwear vendor, through online experiments, Sysko-Romańczuk et al. [37] observe that "experience with the vendor showed a negative correlation with recommendation performance".The factors considered under "experience" include the number of days since account creation, number of days since the first shopping transaction, and the number and the value of purchase transactions made in the past year.Another study reports that "using only the more recent parts of a dataset can drastically improve the performance of a recommendation system" [40].
We interpret the reported counter-intuitive observations from two perspectives.First, these observations are made with respect to the time dimension, more specifically, the global timeline of user-item interactions.Here, we are not considering time as an additional feature or context in the algorithm modeling.Rather, we consider the arrangement of the user-item interactions by their timestamps in chronological order during evaluation. 2 Hence, we have "number of days since the first transaction" and "recent parts of a dataset".The reported counter-intuitive observations call for a revisit of the importance of observing the global timeline in evaluating recommender models.Findings from the revisit may impact our way of conducting evaluation, in turn the model design, and more importantly, our understanding of recommender system.Second, these observations are considered counter-intuitive because they Table 1: The 5 offline evaluation settings described in [12], from the ideal and most close simulation of online process (Setting 1) to the least (Setting 5).The last column indicates whether the data split observes global timeline in user actions.

Setting Train/test data split scheme
Global timeline 1 Step through user actions in temporal order and make predictions for each user action along the way, based on the known user actions at the prediction time.Before the testing time point, every user action serves as a test instance, and subsequently becomes a training instance.

Yes 2
Following Setting 1, instead of evaluating all user actions along time, only evaluate sampled user actions as test instances.The only difference to Setting 1 is the reduced number of test instances along the way.

Yes 3
Sample a set of test users, then sample a single test time, and hide all items of test users after that time point.That is, the data is partitioned to train/test sets based on a single time point.

Partially 4
Sample a test time for each test user (e.g., right before user's last action), and do not observe global timeline across all users.Leave-one-out is an example data split scheme under this setting. No Completely ignore time as in the case that timestamps of user actions are unknown.Data is randomly partitioned into train and test sets.
No contradict our expectation (or an implicit assumption) on a recommender.That is, the more interactions a user has with a system, the higher chance that the recommender better learns the user's preference.However, these observations show otherwise.With global timeline in mind, we conduct a case study to find out: to what extent the global timeline is observed in offline evaluation in academic papers (Section 2).Our case study is based on the full and industry papers published in the ACM Recommender Systems conference in the past three years (2020,2021,2022).Based on the findings, we revisit Popularity, the simplest recommendation model, in Section 3 to justify why this commonly used baseline is illdefined.Then we move on to the discussion on the consequences of ignoring global timeline in evaluation: data leakage (Section 4) and simplification of user preference modeling (Section 5).In Section 6, we propose a fresh look at recommender system from the evaluation perspective.In Section 7, we present a summary of the key messages and contributions of this work, after which the paper is concluded in Section 8.

CASE STUDY: DATA SPLIT SCHEMES
Most academic researchers do not have access to an online platform to directly evaluate their models by real user-item interactions.Evaluation on an offline dataset is the only choice in most cases.It is also well known that there are many more factors that may affect user behaviour online and the prediction power collected from offline evaluations may or may not be observed online.Hence, "the goal of the offline experiments is to filter out inappropriate approaches, leaving a relatively small set of candidate algorithms to be tested" online, as stated in the evaluation chapter of the recommender systems handbook [12,34].However, to conduct offline evaluation, "it is necessary to simulate the online process where the system makes predictions or recommendations" [12].Apparently, a close simulation of online process would make the results obtained from offline evaluation more indicative, better serving the purpose of algorithm selection.
Table 1 summarizes the five settings described in Gunawardana et al. [12] from the ideal setting (Setting 1) of simulating the online We remark that the last two settings (Settings 4 and 5) do not maintain or observe global timeline across all users.Hence, these two settings are not considered as close simulations of the online recommendation processes.As for Setting 3, the partition of train/test sets is based on a single time point along the global timeline.However, within the train or test sets, the data instances may not maintain their temporal order.
To understand which settings are more widely used in evaluating recommender systems, we conducted a case study to collect the data split schemes used in the papers published in the last three years (2020 -2022) of ACM Recommender Systems conference.The ACM RecSys conference is considered here for its strong relevance to the topic and reasonable size.We considered all full papers and industry papers.However, a good number of papers study recommenders from system perspective like training efficiency, distributed and/or federated RecSys.Some others focus on user studies and user preference analysis.Hence, we did not include these papers in the case study.After filtering, we had 82 full and 9 industry papers which had clear descriptions of experiment settings.Among them, we further excluded another 3 papers.Two of them design experiments dedicated to cold-start setting, and one paper is on news recommendation and the data is split by news topic in their experiment.Finally, our case study included 88 papers.Summarized in Table 2, out of the 88 papers, 30 papers adopt random split (i.e., Setting 5).The random split can be either globalbased or user-based.For the latter, a percentage of actions from each user are randomly sampled as test instances; then the remaining are training instances.The next popular split scheme with 22 papers is leave-one-out (i.e., Setting 4), also known as leave-last-one-out, where the last action of each user is a test instance.We also include the cases where the last few actions (e.g., based on a pre-defined ratio) of each user are test instances.Then, 17 papers split data by a single time point (i.e., Setting 3), and 15 papers utilize simulationbased online testing.We note that the latter 15 papers mainly focus on Bandits and reinforcement learning for recommendation.Lastly, there are 4 papers using sliding window for evaluating incremental learning or session based learning.
From this case study, we understand that 59.1% of the collected papers follow Settings 4 and 5 (see Table 1).That is, their offline evaluation do not well simulate the online process, and global timeline across users is not maintained.We also observe that, despite 17% of the papers utilizing a simulation-based online setting, the reason for the adoption is not due to the requirements of the recommendation research problem, but rather due to the algorithms employed in their solutions, e.g., reinforcement learning.Discussion on the evaluation of reinforcement learning-based recommenders [6] is beyond the scope of this paper.
We also note that, there are other data split schemes for evaluating recommenders [29,36,44].Our focus here is whether global timeline is maintained during the offline evaluation.Next, we revisit the Popularity baseline to illustrate why maintaining global timeline is critical.

ILL-DEFINED POPULARITY
As a non-personalized method, Popularity is often considered as the simplest baseline and is widely used for comparison purpose in evaluation.Specifically, in our case study, 26% of the papers use Popularity as one of their baselines.In academic papers and the toolkits for RecSys, the popularity of an item is mostly defined as the number of interactions it receives in the training set.Recall that in our case study, 59.1% of the papers do not observe global timeline in their evaluation, hence they cannot define popularity of items along time.Then an interesting question here: To what degree does the popularity, as determined by the frequency of an item in the training set, reflect its true popularity in a real-world scenario?
To answer this question, we first show how a recommender works in general.With the help of a global timeline, Figure 1 gives absolute time points of historical user-item interactions for three example users  1 ,  2 , and  3 .In the illustration,   indicates the current time point.If the users visit the website at the current time   , then the system will make recommendations to these users.Users then may choose to interact or not to interact with the recommended items.In the illustration, let us assume that all users interact with the recommended items, and these interacted items become the latest interactions for all three users.In practice, a recommender can learn from all or a subset of historical interactions that occurred before time   , and makes recommendations if users visit the site at time   .Now, let us look at some real-world examples of popularity ranking.The New York Times best sellers3 is a well-known and influential list of best-selling books in the United States.The list is updated on a weekly basis since 1931 4 and Figure 2 shows a screencapture of the top few books as at the week of 19 Feb 2023.There is also an option to display monthly lists of different book genres, as shown in the left hand side of Figure 2. The best sellers on Amazon is an hourly updated list. 5In short, it is typical for websites to feature a popularity ranking of items that is updated on an hourly, daily, weekly, or monthly schedule.
The real-world popularity rankings have two important properties.First, the rankings are dynamically updated along timeline.For instance, the best sellers on the New York Times in Week 1 of Year 2020 shall be quite different from that in Week 3 of Year 2023.Second, the popularity ranking only considers item frequency in a predefined time range (e.g., an hour, a week, or a month), and does not necessarily use all historical data.The current weekly ranking only needs to use the interactions occurred in the past week, and the current hourly ranking only use interactions in the past hour.In other words, popularity has a very strong transience effect and often refers what are trending during a (relatively short) time period.
The popularity baseline widely used in RecSys offline evaluations is very different.There is no time window (e.g., a week or a month) defined, and the popularity ranking is not updated along timeline.In fact, as a significant portion of papers (e.g., 59.1% in Table 2) do not maintain global timeline of their user actions, it is not possible to define and update such a ranking along timeline.Hence, the popularity baseline is "forced" to use all interactions in training set.As the result, the frequency-based ranking is a static ranking, covering the entire duration of the training data.This duration is determined by the dataset, and also by the adopted data partition scheme.For instance, the duration of popularity for leave-one-out scheme will be the duration of the entire dataset, and the duration for the single time point scheme would be defined by the data before the time point.
If the data points in a dataset indeed cover a short time period, then popularity remains indicative.However, if the dataset covers user interactions collected from a long time period, then a single static ranking completely ignores the transience of popularity.This ranking becomes less meaningful.In particular, many datasets like MovieLens, Yelp, and Amazon reviews cover data points in a very long time span, e.g., more than 10 years [19].In such datasets, a single static popularity ranking will be very different from the kind of ranking we see in Figure 2. Ji et al. [17] report that if we follow the popularity definition in real world, the performance of the popularity method increases by at least 70%, compared to the ill-defined popularity on MovieLens dataset.

DATA LEAKAGE
The ignorance of transience of popularity is not the only issue for not observing global timeline.Another major consequence is data leakage, or more specifically accessing future data that is impossible to access in reality.
In the following, we again use Popularity to illustrate the issue by mapping the data instances onto a timeline.Recall that the two most widely adopted data split schemes are random split and leave-oneout (see Table 2).In our following discussion, we use leave-one-out in our examples, to avoid potential confusions of different partitions by random.Leave-one-out, or leave-last-one-out, is to sample the last interaction of each user as a test instance.The user's remaining interactions are in the training set.
Figure 3 provides an illustration of the leave-one-out data partition for three example users, with respect to the global timeline.Observe that the test instance for  1 occurs at time  1 .If we consider time  1 to be the current time   as if the offline evaluation were online, then all the historical interactions a recommender can learn from at  1 should be the three interactions by  1 and the one interaction by  2 .A recommender would never have access to the future interactions that will happen in the future with respect to time point  1 .At  1 , the future items include two interactions by  2 and all interactions by user  3 , which occur after  1 .By forcing the popularity baseline to use all training data, the popularity method may recommend some items to  1 that are very popular in future, with respect to time  1 .Clearly, recommending items that are popular in future by using the frequency counting that happened in future is unrealistic.#item releases in a week #user's last interactions in a week Figure 4: Number users whose last interaction occurred, and the number of movies which receives its very first rating (i.e., item release) in each week, in the 10 years (or 520 weeks) period of the MovieLens-25 dataset (reproduced from [19]).
While Figure 3 is an illustration, we are more concerned about the scenarios that occur in reality.In reality, users may interact with a system at any time; new items may become available in the system at any time; outdated items are removed from the system at any time.For example, an iPhone model is usually discontinued after two years of its release, and the phone changes from its first generation in 2007 to iPhone 14 in 2022.Many widely used datasets in RecSys research cover user-item interactions collected in a long time period, e.g., more than 10 years for MovieLens, Yelp, and Amazon.Figure 4 plots the number of users whose last interaction occurred in each week, and the number of new movie releases in each week, in 10 years (or 520 weeks) time period in the MovieLens-25M dataset [19]. 6Recall that user last interaction also indicates the test time point like  1 and  2 in Figure 3.
This leads to the next question: Why the popularity baseline in academic research is evaluated in this way?The reason is simple.We want to ensure a "fair comparison", where all models are expected to learn from the same training set, and to be evaluated on the same test set.Basically, our machine learning-or deep learning-based models are trained on the training set and evaluated on the test set.The popularity baseline is treated as a trivial machine learning model.It takes in all instances in the training set and produces a ranking by the ill-defined popularity for the purpose of "fair comparison" with others.In fact, it is not difficult to simulate a popularity ranking along timeline with scheduled updates, as a non-personalized recommendation method.However, due to the presumed requirement of "fair comparison", all the training and test instances are processed in the same way in an experimental comparison.
Unfortunately, due to the leave-one-out data partition scheme, all machine learning-based or deep learning-based models suffer from the same issue: accessing future data that is impossible to access in reality.Formally, this is known as data leakage in machine learning.Ji et al. [19] offer a detailed study on this topic in the context of RecSys, and Zhao et al. [44] also observe data leakage in their evaluation.If a dataset covers a long time period, the items (e.g., movie, product, restaurant) are not all available for interaction at the very beginning of the entire time period, and users' last interactions may occur at any time points.Hence, data leakage is unavoidable if random or leave-one-out data splits are adopted on such datasets.From this perspective, both the simple model like popularity and the more complex models may not be evaluated in a practical manner because they access future data, unless the data partition scheme leads to no (or at least minimum) data leakage.For example, if the dataset is partitioned by an absolute time point (i.e., all interactions occurred before   are training data, after are test data), then there is no data leakage.Table 2 shows fewer than 20% of papers use single time point split.
In summary, the evaluations conducted without observing the global timeline in an offline setting may suffer from data leakage, rendering to incomparable results [19].Experiments conducted in this way also contribute to the difficulty of reproducibility [44].In fact, a simple way of partitioning data into train/test without considering timeline is also a form of simplification to the RecSys research problem.With time taken into consideration in recent evaluations [18,22,24,40], we start to obtain counter-intuitive observations listed in the Introduction section.

SIMPLIFIED USER PREFERENCE MODELING
The discussion so far on the ignorance of global timeline does not explain (i) why models trained by using only the more recent parts of data demonstrate better performance?and (ii) why more interactions from users lead to poorer recommendations?A hypothetical answer to both questions is the simplification in learning user preference in current models.The missing of the global timeline could be a contributing factor as well.
To better understand user preference modeling, let us go back 30 years to learn how collaborative filtering was firstly defined in the Tapestry system [10].The way that Tapestry system supports collaborative filtering is to let users to read documents recommended by other users.The authors give an example in their original paper [10].A user  wants to read interesting but not all documents from a newsgroup.She knows that some users read all of these documents and mark the interesting ones.She then can simply choose to read only the documents that are marked interesting by these users.Such kind of filtering is conceptually similar to reading only the tweets written or retweeted by the users one follows on Twitter.
In Tapestry, user preference is directly reflected by the other users he/she follows.A hypothetical extension of the understanding is that if user  1 follows  2 , then  1 prefers  2 's decision making in judging interesting documents (or retweeting) given the context at that time, e.g., when a document is received in the newsgroup.
Different from Tapestry, where users are empowered to choose who to follow, user preference in mainstream RecSys research is inferred from user-item interactions.The main underlying assumption is that a user  would prefer the items that are chosen by other users who share similar preferences with .Preference similarity between users is reflected by similar user-item interactions in the past.If users  1 and  2 both purchased the same mobile phone, then we would consider that  1 and  2 share similar preference, at least on this particular item.However, purchasing the same phone may not necessarily reflect that the two users share a similar decision making process, if we consider the context changes in a system from time to time. Figure 5 shows an example scenario where the three users  1 ,  2 , and  3 purchased the same phone at different time points,  1 ,  2 , and  3 respectively.We may further consider that  1 is the first day when this phone was released, and  3 is among the last few days when this phone was to be discontinued, and  2 is the middle time in between.We may also consider that an upgraded version of this phone has been released in between  2 and  3 .In this scenario, the three decision makings could be very different, because the alternative phone models to choose from at the three time points  1 ,  2 , and  3 will be very different, as well as the popularity of the alternative models at these time points.The same applies to other products that have relatively a short life span on sales (from release to discontinued dates).In short, even if two users interact with the same item, if the two interactions occur at very different time points, the contexts for the two decision makings could be very different.The context here is reflected by the candidate items and their properties (e.g., their popularity ranking) at the "decision making" time.
The context here could be just one of many factors that may affect user decision making.Modeling decision making or user preference from user behavior (e.g., user-item interactions) is a complex issue.We refer readers to [15] for a comprehensive discussion of human decision making for recommender system.Furthermore, Kleinberg et al. [21] also examine the inconsistency between user behavior and user true preference, and state that the inconsistency could be a reason for poor recommendations because the platforms are not optimizing for user happiness.Specifically, the authors formalize two decision-making agents for a user to model the inconsistencies of user behavior.Once decision making is "impulsive and myopic" (e.g., enjoy watching shot-form videos for now) while the other decision is "forward-looking and thoughtful" (e.g., shall not spend too much time watching videos).
In our discussion, we solely focus on the context changes at the system side, more specifically, the changes that can be fully observed in an offline dataset. 7For example, new items are released from time to time, and outdated items are no longer receiving more interactions from any user after some time.Other than availability and unavailability of items, different items accumulate different number of interactions at different time points.
Back to the illustration in Figure 5, if user  4 purchases the same phone within the few days of  3 's purchase, then likely the contexts for these two decision makings are very similar.Although it is hard to directly model the context changes between any two user-item interactions, it is reasonable to assume that if two interactions occur within a short time period, the context change at system side is not significant.The length of a reasonable "short time period" may vary from one system to another depending on the characteristics of the items (e.g., news, movie, music, book, restaurant, and consumer electronics) [40].In other words, if a user has many interactions that occurred not too long ago from the test time, the contexts of the past interactions and the context of this test instance are similar.In this case, the test interaction may well align with the user preference learned from the past few interactions.This could be an explanation to why recommender models that are trained with recent parts of the dataset deliver better accuracy [40], and to why the users who have recent transactions enjoy better recommendations [18].
Because of the ignorance of timeline in our modeling, the possible context changes cannot be considered in mainstream RecSys models.Two user-item interactions that occurred 10 years apart are modeled in the same way as if the two interactions occurred within the same day.This could be an implication of ignoring global timeline in our evaluation.
Here comes another question.If global timeline is so important, then why industry players are not highlighting this factor in modeling?One possible reason is that industry players often need to process a large volume of data.In their model evaluation (regardless of online or offline), the recent parts of data are sufficient for training.We can take some recently released datasets as examples.MIND dataset for news recommendation from Microsoft contains interactions of one million users randomly sampled in 6 weeks [42].The user behavior data from Taobao for recommendation contains interactions of one million users randomly sampled in about one week. 8In terms of dataset size, these two datasets are not small; in terms of the time duration they cover, the system side contexts may not change much. 9In industrial-scale systems, Anil et al. [1] state that "limiting training data to more recent periods is intuitive."The authors further comment that if the date range is extended further back in time, "the data becomes less relevant to future problems".In industry setting, recommenders are often periodically retrained/updated with the recent data.For example, in the implementation of the Wide & Deep learning which powers the Google Play recommendation, Cheng et al. [4] state that "user and app impression data within a period of time are used to generate training data", then "every time a new set of training data arrives, the model needs to be re-trained".A more recent model is Monolith [28], a BytePlus Recommend product.The training of Monolith is designed to have batch training stage and online training stage.For online training, the model parameters can be updated at minute-level; hence the model is able to "interactively adapt itself according to a user's feedback in realtime" [28].In this sense, practical recommenders naturally follow timeline and also consider data recency along timeline.
Recall that the "the goal of the offline experiments is to filter out inappropriate approaches, leaving a relatively small set of candidate algorithms to be tested" online [12].Evaluating our models in a manner that closely mimics their online usage remains of utmost importance.Doing so will help to narrow the divide between the algorithms described in academic papers and the ones that drive various platforms.

A FRESH LOOK AT RECSYS
In academic research, we often abstract similar real-world problems (e.g., the various types of recommendation problems in different domains) to a formal research problem.Accordingly, we propose evaluation metrics to quantify to what extent a proposed solution has addressed this problem on different datasets.Very often, a solution is designed largely for the purpose of achieving better evaluation scores.
Unfortunately, the abstraction process to reach a formal problem definition comes along with simplification of real-world problems.In the mainstream RecSys problem setting, the "time" factor has been largely ignored due to the simplification process.Recall the five settings for offline evaluation in Table 1.Setting 1 is the most close to the real-world scenario while Setting 5 completely ignores timestamps.Our case study shows that 59.1% of papers adopt Settings 4 and 5, which is a strong indication of problem simplification.
To give a fresh look at the RecSys problem, the global timeline has to be part of the problem definition, to truly reflect the problem of learning from past interactions then to recommend unseen items.Accordingly, the evaluation methodology has to factor in the global timeline.

More Practical Evaluation
We are not in short of papers on large-scale empirical evaluations [9,35,36,44,45].However, the results reported in these papers may not be comparable to each other due to the ignorance of timeline.A recent benchmark [44] verifies that current evaluation methodology leads to recommending "future items" which would never occur in reality, a consistent finding earlier reported in [19].Despite the existence of many large-scale empirical studies, there remain questions on reproducibility, and technical and theoretical flaws [5,8].On the other hand, it is a challenging problem to evaluate recommender systems, because the evaluation metrics can be defined from multiple perspectives [2,16,43] and there remain many challenging issues even if these metrics are well defined [39], and even if the evaluation is conducted online [3].
We probably want to begin with something simple: Can we fairly compare our model with Popularity, the simplest baseline?In other words, our recommenders shall be compared with a Popularity ranking that is used in practice, i.e., the ranking of items on a hourly, daily, or weekly basis.
In Figure 6, the vertical line in purple is an illustration of Popularity ranking of items at a time point slightly before the current time   .When users interact with the system at   , the popularity ranking provides the most popular items in the past   time duration.  can be one hour, one day, one month, or the entire history, i.e., a parameter depending on the characteristics of the items.This popularity method can be more precisely named as Recent Popularity where the recency   is configurable.If   is set to cover the entire duration of all existing training data, then the ranking is the most popular items in history.Note that the most popular in history remains different from the ill-defined Popularity baseline, because of the observation of the global timeline in Figure 6.
6.1.1The Timeline Scheme.We now apply a similar concept to evaluate any RecSys model in offline setting.The first step in the evaluation is to split data into training and test sets.We may follow any existing data split scheme.In the following, the leave-oneout scheme is used as an example.With leave-one-out split, the last interaction of every user is in the test set and the remaining interactions of the user are in the training set.
To observe the global timeline, all user-item interactions (in both train and test) are arranged in chronological order by their timestamps.Then the entire timeline is split into time windows of size  as shown in Figure 7.We evaluate a model on all test instances within one time window  at each time, window by window.Suppose +2 is the current window for evaluation, within which  1 and  2 are the two test instances (see Figure 7).The model shall be able to learn from all or subset of the following data instances: (i) all training instances in  +2 , and (ii) all training and test instances in all the windows before  +2 .For each test instance, we compute the evaluation measures (e.g., precision/recall), and aggregate the results.The aggregation may happen at each window, or across all test instances in all windows.
The proposed timeline scheme can be considered as a scheme sitting between Setting 2 and Setting 3 if mapping to the 5 settings in Table 1.It offers evaluations at multiple time points (through time windows) and maintain global timeline.Note that, Setting 3 in Table 1 (or single time split) evaluates a model at a single time point only.
6.1.2Discussion on the Timeline Scheme.We note the following points for the timeline evaluation scheme described above.
First, this evaluation scheme observes global timeline by design.Second, this evaluation scheme still suffers from data leakage.However, the amount of data leakage is controlled by the size of time window  .If the time window is reasonably small with respect to the entire timeline (e.g., one month vs ten years), then data leakage may not significantly affect the results.The amount of data leakage in each time window trades off the total number of windows in the entire evaluation.We may consider that the current mainstream offline evaluation is a special case of the timeline scheme: there is only one big time window covering the entire timeline duration.To completely avoid data leakage, the evaluation scheme can be changed to only allow the model to learn from previous windows when evaluating on the current time window.Third, along timeline, the model needs to be retained or updated for evaluation on each time window.The design allows a model to learn from different sets of training instances.For example, recency popularity may derive the popularity ranking from only one or a few recent windows.Other machine learning based models may choose to use training instances from recent windows as well, i.e., only the recent parts of the data.In this sense, the number of recent windows/interactions becomes a hyper-parameter in model training.This is a different understanding of "fair comparison" from the current mainstream setting, where all models use the same set of training data.
Lastly, the timeline scheme is just one possible way to evaluate RecSys in a more practical manner and the scheme might have already been used in some previous studies.In particular, Lathia et al. [23] conducted experiments on Netflix data with a window size of 7-day and evaluated two recommenders with dynamic modal updating along time.There are other data partition schemes that do not lead to data leakage, for example, partition by time point.Further, there is also a line of research in RecSys known as incremental learning.In this setting, a model learns from past data and predicts future interactions along timeline, by definition.The discussion here is to offer a revisit to the batch-oriented train/test offline evaluations that do not consider timeline.
The key consideration here is to observe global timeline in the evaluation process.The key assumption here is that the system context does not change much within a time window.As aforementioned, this evaluation scheme is not problem free, e.g., data leakage remains possible.However, exactly mimicking real-time setting (i.e., Setting 1 in Table 1) would lead to a very complex evaluation process, which is less practical in academic research.The timeline scheme described above is a relaxed version of Settings 1 and 2.
A potential challenge of observing timeline in evaluation is data sparsity in RecSys.With timeline in consideration, the cold-start issue becomes a common problem for almost every user.Basically, with time in the picture, the user-item interactions are no longer projected to a two-dimensional space (i.e., user and item), but a three-dimensional space (i.e., user, item, and time).This would make the RecSys problem much more complicated and challenging, but could be more interesting as well.At the same time, data sparsity could be partially eased if additional information (e.g., attributes) are available with users and/or items in the released dataset.This also calls for a revisit of dataset release from industry on what information can be released in addition to the user-item interactions, without compromising user privacy.

Meaningful User Preference Modeling
A recommender is expected to answer a user's latent (information) needs.The user-item interactions are the observed results, or the answers to the earlier information needs.In the original design of collaborative filtering, users choose who to follow as the expected "information filters".In the current RecSys model design, users are not empowered to proactively choose the information filters.Rather, the information filters are modeled based on users past interactions.However, such simplification in user preference modeling only captures the results of the decision makings, and does not capture the "contexts" of the decision makings.Even if two users interact with the same item, they may not make decisions in a similar context, particularly when the two decision makings occur a long time apart from each other.However, it is hard to model the decision context, given the limited data in academic research.In the following, we discuss a few possible ways to model context similarity between two decision makings.Nevertheless, more and deeper research is expected in this area considering decision making is a complex issue [15,21].
One possible way of evaluating similarity of decision contexts is through impressions.In simple words, impressions are a list of items presented to a user when she/he makes an interaction.For example, if user  1 interacts with item  when presented with impression {, , ,  }, and user  2 also interacts with item  with impression {, , ,  }, then the two decision makings are based on different contexts, although the final decisions are the same.Recently, a few datasets are made available with impressions, including ContentWise Impressions [31], MIND [42], and FINN.No Slates [7] datasets.Such datasets are of great value for exploring new ways of user preference modeling.There is also a study on evaluation of recommender systems with impressions [30].
Large-scale recommender systems typically consist of multiple steps like matching (i.e., candidates generation) and ranking [27,45].In the matching step, candidate items are identified from all available items.In the absence of impressions, the decision context might be reflected by these candidate items, provided that the matching step observes the global timeline and retrieves only the available items at the corresponding time point (i.e., a test instance's timestamp).Here, the similarity of candidate items is a proxy to measure the similarity of two decision contexts.The reason of using candidate items instead of the final ranked items is that candidate items are less dependent on a particular ranking algorithm.To generate candidate items for every interaction is very expensive.A simplification here could be an assumption that if two interactions happen within a very short time period, then the decision contexts are similar.Then, a function of time duration could be used to model the context changes between two interactions.
Modeling user preference with the consideration of decision context offers us a fresh look at many specific problems in RecSys.One of them is sequential recommendation.In sequential recommendation, interactions (or actions) of a user form a sequence and the task is to predict the user's next interaction.A user sequence preserves the relative sequential order of her interactions, but does not record the timestamps of these interactions.Petrov and Macdonald [32] show that recent training interactions (in terms of sequential order) in a sequence better indicate the user's interest.The recency by sequential order can be an approximation of the recency by timestamps.Nevertheless, without recording timestamps of the interactions in a sequence, it remains difficult to precisely model the context differences in decision making.For one user, her first and last interactions could be one year apart.For another user, all interactions in her sequence may occur within a week.The decision making contexts in these two sequences will be quite different.Hence, it would be more meaningful to model interactions with timestamps in sequences.By considering timestamps, we are also in a better position to evaluate whether some datasets are indeed suitable for sequential recommendation.For example, the Movie-Lens dataset is not suitable for sequential recommendation, because there is no meaningful sequence in the dataset [41].

DISCUSSION
First, considering temporal factor in offline evaluation is not new, evidenced by the 5 settings listed in Table 1 originated from [12] and its earlier edition [11].An example evaluation with a sliding window has also been reported in 2009 [23].Nevertheless, as shown in Table 2, the mainstream offline evaluations do not consider global timeline.Simulation-based online settings are being increasingly adopted in recent studies primarily because of the algorithms they use, such as reinforcement learning, rather than a general need for recommendation models.Researchers who are new to this field may follow the bulk of existing literature and simply borrow the widely adopted offline evaluation settings without much further consideration.This paper offers an alternative view for researchers to reconsider how to conduct offline evaluations to better serve the purpose of selecting best candidate algorithms.
Second, this paper tries to provide an alternative view of "fair comparison".If two methods  and  are given access to the same set of training data, then it becomes a method's own choice on which portion and how to use the training data.For example, if a Popularity method achieves its best performance by using a weekly updated ranking along time, then it should not be forced to use all historical data to produce a static ranking.Similarly, if a machine learning model performs the best by learning from only the recent portion of the training data, then it is fair to compare with another model which learns from more historical data, as long as both models have access to the same training set.In this case, the amount of training data to use becomes a model hyperparameter as studied in [40].
Third, as a new look at recommender systems, this paper proposes a timeline scheme for offline evaluation as a better way to simulate a model's practical setting.Again, this general concept of the timeline scheme is not new.However, the key message here is to maintain global timeline in evaluation when the dataset covers a long time period.The scheme is designed to balance the evaluation complexity (e.g., the number of models need to be trained or updated along time) and the potential issues (e.g., data leakage).On the other hand, a potential risk here is that once "time" becomes a parameter in the evaluation process (e.g., the size of time window in the timeline scheme), it may become yet another parameter to be optimized for through system design.More research is expected to come out with a more effective and accurate evaluation mechanism for offline RecSys evaluation.
Fourth, this paper also briefly touches the concept of modeling decision making contexts, rather than the results of decision making (i.e., user-item interactions) for user preference modeling.To our understanding, decision making context is time-dependent.With the timeline scheme in evaluation and the models are retrained/updated along timeline, the modeling of decision contexts could lead to more interesting findings in recommender systems.
Lastly, it should be noted that the criticisms raised in this paper regarding poorly executed evaluations are primarily directed towards recommender systems in academic research.While industry practitioners typically use recent data to train and evaluate their models and regularly update them, we believe that there is a pressing need to improve the quality of offline evaluations in academic research in order to bridge the gap between academic and industry practices.On the other hand, the extent to which our conclusions about the model effectiveness would be altered by taking the global timeline into account during evaluation has not been thoroughly investigated.

CONCLUSION
We start with a few counter-intuitive observations made in recent studies; then we explain the reasons behind.One key reason is the ignorance of the global timeline in model evaluation, which leads to a poor implementation of the simplest baseline Popularity.Interestingly, because industry players often have to limit to recent data due to large data size, they may not highlight the importance of timeline.However, in academic research, many widely used datasets cover interactions recorded in a long period.Following a similar problem understanding as industry players but evaluating on datasets of different characteristics is the main contradiction here.The missing of timeline leads to improper evaluation of our models as the offline settings are far away from a good simulation of the online scenario.Hence, the models developed in academic research are rarely transferable to practical systems.In this paper, we highlight the importance of timeline for a better simulation of the online setting and hence more indicative results of which algorithms are more worth the expensive online testing.On top of that, the consideration of timeline also provides us new insights in modeling user preference.We shall not only focus on the user-item interactions, which are the results of decision makings, but also focus on the contexts of decision makings.After all, we aim to model user preference in making decisions.
This perspectives paper calls for research in the following directions.One is the adaptation and standardization of the timeline evaluation scheme, where global timeline is build in the evaluation process.This direction includes two subtasks: (i) the study on the extent to which our existing conclusions about model effectiveness would be altered by taking the global timeline into account, and (ii) the complexity of introducing time as another factor in the evaluation process.Another direction is the deep understanding of the relationship between user behavior and recommendation.This direction also includes two subtasks: (i) a better way of interpreting user-item interactions as the results of decision makings for effective recommendation, and (ii) a better way to understand and model user decision making.

Figure 1 :
Figure 1: Train/test in practical systems, where   indicates the current time point.

Figure 2 :
Figure 2: The New York Times best sellers, 19 Feb 2023

Figure 5 :
Figure 5: Context for decision making of an interaction e.g., phone purchase

Figure 6 :Figure 7 :
Figure 6: Illustration of evaluation of (recent) popularity along timeline.The popularity of an item is derived based on the interactions received in time duration   .

Table 2 :
Number and percentage of papers by their adopted data split scheme in ACM RecSys conference (2020 -2022).
process as close as possible, to the most simplified setting (Setting 5).For simplicity, in our discussion we only consider training and test instances, and do not consider validation or development set.