ForeSeer: Product Aspect Forecasting Using Temporal Graph Embedding

Developing text mining approaches to mine aspects from customer reviews has been well-studied due to its importance in understanding customer needs and product attributes. In contrast, it remains unclear how to predict the future emerging aspects of a new product that currently has little review information. This task, which we named product aspect forecasting, is critical for recommending new products, but also challenging because of the missing reviews. Here, we propose ForeSeer, a novel textual mining and product embedding approach progressively trained on temporal product graphs for this novel product aspect forecasting task. ForeSeer transfers reviews from similar products on a large product graph and exploits these reviews to predict aspects that might emerge in future reviews. A key novelty of our method is to jointly provide review, product, and aspect embeddings that are both time-sensitive and less affected by extremely imbalanced aspect frequencies. We evaluated ForeSeer on a real-world product review system containing 11,536,382 reviews and 11,000 products over 3 years. We observe that ForeSeer substantially outperformed existing approaches with at least 49.1\% AUPRC improvement under the real setting where aspect associations are not given. ForeSeer further improves future link prediction on the product graph and the review aspect association prediction. Collectively, Foreseer offers a novel framework for review forecasting by effectively integrating review text, product network, and temporal information, opening up new avenues for online shopping recommendation and e-commerce applications.


INTRODUCTION
Customer reviews reflect the properties of the product and the needs of the customer on online shopping systems [24,88].As a result, one widely studied task is to extract descriptive keywords from customer reviews, such as "not greasy" and "soft inside".These descriptive keywords, which are referred to as aspects in the literature, are later used as key features for various applications that are beneficial to buyers making decisions and sellers improving the market, such as customer behavior prediction [48], sentiment analysis [42], opinion mining [26,71], product rating [34], and marketing refinement [62].To tackle this problem, many text mining methods have been proposed to mine such aspects from customer reviews [4,7,18,25,53,73,79,83,84].
Despite its usefulness, however, customer reviews are also subjective and sometimes biased, therefore mining effective aspects from reviews for a product requires adequate numbers of reviews.For a new product with limited customer reviews, aspect mining methods will suffer from starting cold and mining inaccurate and noisy aspects.In this paper, we aim at a novel product aspect forecasting task, by forecasting the top aspects in the future for products with inadequate reviews (Fig. 1), i.e., what aspects will the customer use to describe a new product after six months or three years?To the best of our knowledge, product aspect forecasting has not been studied in the literature and is one step further than traditional aspect extraction tasks.The most related task is zero-shot aspect-based sentiment analysis [9,17,54], but their frameworks are restricted to sentiment analysis or still require multilingual supervision.The recent progress in large language models such as ChatGPT [44] or GPT-4 [2] may also open a door for improved zero-shot aspect extraction performance, but besides the heavy computational costs using them, the limited availability of reviews of new products still restricts the model from forecasting their top aspects in the future.In this paper, we formulate the product aspect forecasting task as a multi-future top-K aspect retrieving problem.An intuitive solution to the problem is to find similar products with longer histories and infer based on their received feedback, just like a customer would search for the reviews of similar older products that have the same functionality, brand, or same style before making the decision.Motivated by this, in this paper, we propose a three-stage framework ForeSeer to address it.ForeSeer is a flexible textual mining and product embedding framework progressively trained on temporal product graphs.The key idea is to find similar older products from the temporal product graph and propagate potential aspects mined from their reviews to the new item.This enables forecasting currently under-estimated aspects which might be frequently mentioned in the future (Fig. 2).ForeSeer has three core steps, contrastive review-aspect association learning, progressive temporal graph-based product embedding, and aspect-guided product embedding refinement with temporal information.The first step efficiently captures the static semantic relationship between review texts and aspects, and the second step together captures the evolving graphical product similarity information and how products aggregate reviews over time.In the third step, we design a novel product-aspect temporal attention (PATA) module to help adjust the product embeddings guided by aspect embeddings for multi-length future times forecasting.
We evaluated ForeSeer on a real-world customer review dataset, which contains 11,000 products, 11,536,382 reviews, and a product similarity graph constructed using user queries and clicks.Our dataset contains the timestamp for each review and product over three years.We observed substantial improvement against comparison approaches on product aspect forecasting in both settings with accessible estimated or ground truth association.We found the product embeddings derived from ForeSeer show more visible patterns when the timestamp is increasing, suggesting an accurate incorporation of temporal information.In addition to aspect forecasting, ForeSeer achieved prominent performance on multi-time future link prediction and review-aspect association prediction, demonstrating its wide applicability in modeling temporal review information.ForeSeer is developed as, to our knowledge, the first approach to product aspect forecasting and can be broadly applied to other temporal graph mining tasks.The key idea of ForeSeer is to find and embed similar products dynamically and propagate their top aspects mined from reviews to the new item to forecast future aspects.

PRODUCT ASPECT FORECASTING 2.1 Problem Definition: A new task
In principle, products receive and aggregate reviews and aspects are mined from the review text pieces, enabling the products to extract and aggregate aspects associations.We then understand relationships among the three core components, products, aspects, and reviews in detail, and formally define the product aspect mining and forecasting problem as a top-  aspect retrieval problem.
2.1.1Product aspect mining from review texts.A descriptive aspect   , is a descriptive textual span extracted from a review text sequence   that summarizes a particular attribute or feature of   , the product   belongs to (examples and illustration in Table 1).Given an aspect list  = { 1 ,  2 , . . .,  | | } and   , we define the review-aspect association vector v  ∈ {0, 1} | | , where    = 1 if   and   are associated.The aspect list can be obtained by supervised or unsupervised mining from reviews [77,84].The longer the list, the more descriptive and diverse aspects can be included in the list.
A product network G = {P, E} is given with a set of product nodes P and an edge set E representing the similarities between products.A  is the weighted adjacency matrix of E. Each product receives reviews and maintains a review list   = { ,1 , . . .,  ,|  | }.Let v , be the review-aspect association vector of  , and =1 be the collected   's association list of   .The productaspect association vector u  ∈ R | | can then be obtained by aggregating all the review-aspect association vectors in the review list through (.), a permutation-invariant aggregation function: The binarized top-  aspect label vector y    = y  can then be constructed by ranking u  and picking up the top-  product-aspect associations of   to be positive.Let x  ∈ R   be the node feature of   , the product aspect mining task is using x  and graphical information in G such as A  to predict y  .Note that the number of positive labels can be less than   , resulting in the multi-label nature of y  .
Table 1: Three review-aspect association examples are presented to show the different difficulty levels mining these aspects.The mining difficulty for an aspect depends on its relationship to the related corpus (More challenging from top to bottom).with reviews up to  , , and the product-aspect association vector becomes u , by replacing the    to   , in Eqn (1).Let the above  be the present time, and x , be the node feature of   in G  .Suppose at a future time  ′ =  + Δ, the product aspect vector and the top-  aspect label vector is u , ′ and y , ′ .The Δfuture product aspect forecasting problem at time  is to predict y , ′ using x , and G  .Moreover, we are interested in forecasting top product aspects y , +Δ  at multiple future with  Δ  different time lengths Δ ≜ {Δ  }  Δ  =1 simultaneously using information till .In practice, the accessible information is till the present  (referred to as 'now').Yet setting ' at now' to an earlier stage will therefore meet with the old product that is earlier 'new', together with a smaller product network that has fewer products with shorter review lists.

Motivations and challenges of the problem
2.2.1 Motivation.We illustrate the motivation of the product aspect forecasting problem.In a large e-commerce system, existing products keep receiving reviews from buyers, and sellers continue to release more and more new products (Fig. 1).The high-level target of this problem is predicting the customer feedback for products in the long future: e.g., what will be the top aspects of a new product six months or three years later?This problem might be easier to answer for products with long histories -they have many customer reviews, thus simply extracting all aspects till now and counting the most frequent ones can already obtain statistically significantly satisfying results.However, for the larger and more important group of newly released products with limited reviews, such a strategy will give a much less reliable aspect forecasting, as aspects that currently have low frequency might be frequently mentioned in future reviews.On one hand, customers cannot trust products with less accurate aspects prediction; on the other hand, sellers worry about long-history products monopolizing the market.Therefore, a robust and reliable solution clearly relieves the dilemma that new products have limited reviews to help forecast on both the seller and customer side despite the non-trivial problem nature.
2.2.2 Challenges.Forecasting aspects for 'new' products can not be effectively achieved by trivially aggregating mined results from aspect mining methods, as the number of reviews is limited besides their subjectiveness.Also, the scale of reviews can be very large and aspect list length  can also be large in practice, leading to high sparsity of positive associations and extremely imbalanced aspect frequencies (Table 2).Therefore the top-K aspect retrieving formulation also introduces association sparsity and extremely imbalanced aspect frequencies, making it more challenging.
Another challenge is that the 'new' products are distributed at different time points, resulting in unaligned review list series evolving processes.It is because reviews will keep occurring for products once they are released, and all the products will receive different reviews at different times.Since forecasting aspect trends for the 'new' products with limited reviews is more important and challenging, trivially training a graph neural network from a single product graph is also not effective.The evolving product-review associations and product similarities need to be efficiently extracted, aggregated, and embedded across the whole time range to avoid over-fitting effects.
The third challenge is that the aspect evolving trends will be quite different in shorter and longer time periods, as y , ′ will reflect the genuine converged customer feedback when Δ → ∞ and the recent popular thoughts from customers when Δ is small.Therefore predicting top aspect trends for products at different Δ future indicates capturing different stages of product evolving from the entire 'new' (with limited reviews) to 'old' (with adequate reviews) evolving process from time to time.Trivially using one single product embedding at current time  without further futurelength aware adjustment will result in unsatisfactory performance.

METHODS
We propose ForeSeer, a textual mining and product embedding framework progressively trained on temporal product graph to predict the dynamic product embeddings at a future time point  + Δ using all information till the current time point .ForeSeer then exploits the product embeddings to predict the top aspects of that future time.ForeSeer is a three-stage framework (Fig. 3).In the first stage, we exploit contrastive learning to co-embed reviews and aspects, leveraging the aspect semantic information to acquire better review embeddings.We resample review-aspect associations to increase the presence of rare aspects so that the review embeddings can be less biased to frequently-appeared aspects.In the second stage, we progressively train the product graph embedding network on the temporal product graph.This helps ForeSeer see more new products and the new reviews they received, as well as evolving product similarity information, resulting in an efficient product embedding alignment.In the third stage, we develop a  novel Product-Aspect Temporal Attention (PATA) module to further adjust product embeddings with the help of learned aspect embeddings.PATA module helps to forecast future product-aspect associations with explicit consideration of temporal information, thus providing time-sensitive forecasting.

A base model
We first introduce a base model that extends aspect mining methods for aspect forecasting and later contrast this base model with Foreseer to clarify the key technical ideas of ForSeer.Let R = {  } |R| =1 be the list of reviews collected from all products.The base model builds a BERT-based textual encoder that is pretrained on all reviews with the mask language modeling (MLM) objective [10].The model can then be used to embed   as h  = BERT(  ).At present time  ('now'), the learned review embedding h  can then be aggregated to the product   they belong to as f  via an aggregator function.Let   be the review list of   at time  and  , ∈   be one of the review of   , the aggregation process is: The base model then leverages x  = [f  ; SE  ] as a node feature of   and trains a graph neural network on the last observed product graph G  to further get a product embedding   .Here, SE  is the static product-specific embedding that encodes other productrelated information, such as seller descriptions, which will not change over time.Let A   be the weighted adjacency matrix of G  , and  be the product node feature matrix of all products that make up x  , where   ∈ G  .Let  be an activation function and  (0) =  , for  (  ) , the output of the   -th GNN layer we have: Let   =  (  ) be the resulting embedding matrix of all products after the final layer, its -th row z  is the embedding of   and is used to get the prediction of top aspects in the future.Suppose we are interested in multiple futures Δ ≜ {Δ  }  Δ =1 , we use  Δ classifier heads {CLS  →, }  Δ =1 to map   and different Δ  to the prediction ŷ,+Δ  for different future  + Δ  .They are finally optimized with a multi-time multi-label cross-entropy objective: (4)

Mining aspect to enhance review embedding
The major limitation of the base model is it only pre-trains the language model using review texts instead of clearly capturing the review-aspect association.Because reviews are usually long and subjective (Table 1), it is very inefficient to capture signals of potential aspects from the embedding of the entire review text.Also, to have more descriptive and diverse aspects, the aspect list can be very long.Under this situation, the positive aspect association in review texts will be very sparse, and the aspect distribution can be really imbalanced and long-tailed.We thus propose to refine the review embedding by fine-tuning the review encoder to mine aspect associations.The key intuition is to embed the review focusing on representing sparse aspect associations.Specifically, given a review text   , we fine-tune its embedding h  to capture the review-aspect association vector v with a classifier head CLS  → (.) and a multi-label cross-entropy loss objective: Note that here the association between the review texts and the resulted aspects will not change with time; it is therefore compatible to combine this objective with the MLM objectives [10] in the BERT pre-training stage.The resulting prediction of the association of the review aspect v can be viewed as part of the refined review embeddings.For instance, the updated representation h ′  = [h  ; v ] can be the concatenation of BERT embedding and predicted association vectors.Having it can thus better capture sparse aspect association information during review aggregation using Eqn.(2).

Aspect-guided review embedding matching
However, the limitation of the above refinement is that it has no idea that some aspects might describe similar meanings.It thus lacks the ability to efficiently capture the tail aspects that might have a similar meaning to the top frequent aspects but are rare and more diverse.Incorporating class names is useful when classes have their own description [40].Therefore, our next key improvement is to jointly embed reviews and aspects through cross-domain contrastive learning.The key intuition is that by also embedding aspects, tail aspects can be closer to head popular aspects if they have similar semantic meanings.The resulting aspect embeddings can therefore guide two review embeddings to be closer if they point to aspects with similar meanings.Moreover, if we treat the base model as a review feature extractor, training a cascaded contrastive learner will enable an effective way to train multiple review feature extractors and incorporate resampling and ensemble techniques.
Specifically, we train multiple feature extractors by instantiating  base aspect extraction replicas with different reweighting factors  1 , ...,   in addition to the standard aspect extraction instance ( 0 = 0).We use h to denote the trained features output by the -th replica follows Eqn (5).We then downsample frequent aspects to encourage the model to focus on review-aspect associations from infrequent aspects.For replica , we first construct a bin for each aspect by assigning reviews that have positive associations with that aspect.Here, a review might be allocated to multiple aspect bins.We thus apply the resampling strategy hierarchically, which chooses the -th aspect bin with probability   by reweighting and then normalizing the importance of the aspect based on the reciprocal of the   -reweighted aspect frequency: The reweight factor   leads the model to focus more on the aspects of tail (large   ) or head (small   ).  = 0 corresponds to the standard aspect extractor that conduct uniform sampling.Then, a review   is sampled from the chosen aspect bin with an assigned importance weight based on the number of review-aspect associations and the reweighted aspect importance.This resampling strategy enables the possibility to ensemble the features from all  feature extractors and perform mixtureof-experts (MoE) learning in the contrastive learner.We can now exploit contrastive learning to co-embed reviews and aspects so that the embedding of reviews can pay more attention to rare aspects.Specifically, we first build review and aspect embedding networks   and   with the same output dimension .  encodes the embeddings of the review   and the predicted associations from all replicas of the extraction of aspects simultaneously and matches them into a calibrated embedding of the review z   : For aspect   , we use SpanBERT [22] to get the pre-trained features and feed it into   to get the aspect embedding z We then utilize the aspect embeddings z

𝑎𝑠𝑝 𝑙
of   to guide the optimization of the calibrated review embedding z   of review   using the inner product of z   and z   : We finally employ a multi-label cross-domain contrastive loss similar to NT-Xent loss [56] to maximize positive association and minimize negative association denoted by v  : By jointly learning an informative, low-dimensional embedding space for reviews and aspects, the resulting s  has two advantages.
First, it will efficiently ensemble learnings from multiple aspect extraction replicas, thus more effectively capturing tail aspects.Second, by minimizing Eqn (10), s  will encode the ranking information of the aspects for   .This indicates a more flexible way to retrieve   positive associations.By simply tuning   , more (recall-inclined) or less (precision-inclined) positive associations can be given, resulting in a more flexible aspect extraction module.

Progressive temporal graph-based product embedding
The major limitation of the graph-based product embedding module of the base model is that it is only trained on the last observed product graph snapshot with the product features at that time.In fact, only a small ratio of products are 'new' in the last observed graph snapshot.This leads the model to have severe overfitting effects.We thus propose to progressively train the model on the temporal product graph snapshot series rather than only on the last observed graph.The key intuition is that the model can not only see more 'new' products during training but also be aware of the other temporal evolving information, including product similarities and how a product becomes from 'new' to 'old' by receiving more and more reviews.Without loss of generality, we now let  ∈ T be a time point we want to train our model at and let the graph be the snapshot G  at that time.To enable progressive training, multiple fixations are needed.First, the feature vectors of product nodes need to be aware of the evolving temporal information, such as the related information of the progressive product-aspect associations until time .Given a review  , ∈  , , the review list of   at time , we aggregate the learnt embedding z , and the predicted reviewaspect association s , for all reviews in  , to acquire the product feature f , of   at time : We then incorporate the related temporal information of   at time .Similar to the positional embedding [61], the 2d-dimensional temporal embedding of a discrete-time index  0 is: For   at time , there are three pieces of temporal information that need to be incorporated: its proposed time    , the training time  and the gap between these two times    =  −    .The resulting temporal embedding therefore has three sub-parts: We then compose the overall node feature vector x , for   at training time  with three parts of features mentioned before: Let A   be the normalized weighted adjacency matrix of G  , to enable progressive training, we sequentially train the graph neural network at a subset of graph snapshots {G  1 , . . ., G   |  ∈ T }.We further refine the GNN layer that contains an extra multi-layer perceptron (MLP) module that focuses on learning product-specific information.Specifically, let    −1 and   , −1 be the weight matrix of -th GNN layer and the extra MLP module after training the network on G  −1 and let  (0)   = x ,  , the product embedding process in Eqn (3) will be updated at   as: At present time , we take    =  (  )  got on G  as the resulting product embedding matrix.By progressively updating the network, ForeSeer can see more new products and how they 'grow' from 'new' to 'old', which is helpful in efficiently aligning the product embedding.

PATA: Aspect-guided product embedding adjustment for multi-future forecasting
One limitation of the base model is it uses only one product embedding for all Δ  future predictions without any refinement when capturing future product-aspect associations.We thus develop a novel multi-head product-aspect temporal attention (PATA) module to adjust product embedding based on learned aspect embedding   .Our intuition is to adjust product embedding with a designated attention module guided by aspect embedding.This novel design empowers ForeSeer with the ability to predict both short and long-future time ranges with lower computational costs.Specifically, given  Δ target futures, for each Δ  , we set up a query network   Δ  and a key network   Δ  to get the product query matrix from    and aspect key matrix from   : We then calculate cross-attention score   +Δ  between product query and aspect key for Δ  future: We then set up a shared value network   that takes    as input and outputs value matrix     .Let  , and  , +Δ  , the -th row of     and   +Δ  , be the value and attention score vector for   , we obtain the final product embedding for Δ  future as: We next exploit the final product embedding to get the final prediction using Eqn (4).With the above design, we can acquire different product embedding for different Δ  future prediction tasks.Using a shared value network is especially efficient when  Δ  is large and  ≪ ||.This also regularizes the model by sharing weights for learning multiple imbalanced top-  aspect retrieving tasks.

EXPERIMENT
In this section, we evaluated ForeSeer on a large scale e-commerce dataset with a variety of tasks, aiming to answer the following research questions (RQs): • RQ1: How does ForeSeer perform on product aspect forecasting without or with genuine annotated aspects for reviews?• RQ2: Is ForeSeer sensitive to aspects with long trends?• RQ3: How does different components contributes to the success of ForeSeer on capturing review-aspect associations?• RQ4: Is the PATA module in ForeSeer able to capture temporal information for different lengths of future?• RQ5: How does the product embedding captured by ForeSeer evolve over time?
4.1 Experimental setup 4.1.1Dataset.We evaluated our ForeSeer framework on a largescale e-commerce dataset that contains a dynamic product network with 11,536,382 time-stamped review events in a three-year period.
The product network is a series of homogeneous dynamic product similarity graphs with 1,096 daily time-stamped snapshots.The final graph has 11,000 product nodes from 418 different product types based on their name, high-level property, and usage, such as "Lamp", "TABLE", and "Caddy".The unpublished product nodes will not be added to the graph until their proposed times are reached.
The average start date of products is 77.The first snapshot has 999,614 edges and the final snapshot has 2,400,404 edges.Each edge represents a similarity ranging from 0 and 1 between the two products.We calculate the similarities based on multiple factors such as user click information.We obtained an aspect list with 10,000 popular aspects extracted via sequential extraction tools [76,77] and 30,217,638 review-aspect associations from these reviews.We constructed the top-  (  = 10) aspects label for every product node at all timesteps by aggregating all the review-aspect associations it has and picking up the top-ranked aspects after normalization.Example head and tail aspects with their frequency over three years are shown in Table 2.While the top 3 aspects have more than 1,000,000 counts, the tail aspects only have 150 counts from more than 11 million review texts, demonstrating extreme aspect imbalances.4.1.2Tasks.We studied ForeSeer with three challenging tasks: single review-aspect association prediction, multi-time aspect forecasting, and multi-time link prediction.
Single review-aspect association prediction: We model the single review-aspect association prediction as a multi-label classification problem.We split 80% of the reviews as the training set and leave the remaining 20% as the test set.

Multi-time aspect forecasting:
To further assess the effectiveness of the learned review associations, we tested the multi-time aspect forecasting task with two settings: (i) ground truth association accessible, (ii) only learned approximate association accessible.For this task, we chose to predict Δ = 3, 6 months and 3 years (always predicting the last observed label) at every test timestep .We use the label at the final time if the target future time is out of range.We aimed to test the inductive performance by randomly splitting 10% nodes out as the test set.
Multi-time link prediction: For multi-time link prediction, we chose to predict the link at the current time, 180 days later, and at the final time at every test time .The positive edge ratio is 50%.

Comparison approaches.
There is no existing framework that directly handles the multi-time aspect forecasting problem.We therefore designed and compared our method with four types of baselines.Frequency baseline directly aggregates the association learnt in section 3.2 at time  and uses it as the prediction.It can not incorporate other information and the performance will highly depend on the quality of learned aspect associations from reviews.We aim to assess the effectiveness of the multi-time forecasting PATA module.MLP baseline directly predicts the multi-time future with product features without graphical information.Recurrent-based baselines (GRU, LSTM) predict multi-time future step-by-step by using an autoregressive network structure and also ignore graphical similarity information.We thus aim to assess the importance of using graphical information.Graph-based baseline (GNN) leverages only the final graphical information G  .We aim to assess the effectiveness of our progressive training process in multiple Δ-future.Weakly supervised baseline (OA-Mine) directly mines candidate aspects from reviews and aggregates the candidates as the final predictions.We aim to assess the effectiveness of learned review-aspect associations and the importance of using learning-based models as this baseline does not leverage reviewaspect association information.
For multi-time link prediction tasks for different Δ-future, we only compare our method with graph-based baselines, since other baselines can not handle this task.We built an extra lightweight GNN for multi-task link prediction for our methods and a GNN baseline with the same network structure.For our method, we used features learned from our multi-head product aspect temporal attention module for different futures as the input for different future time predictions.For the graph-based baseline, we used the same features for different future time predictions.4.1.4Implementation details.For the aggregation function (.)used to construct future aspect labels, we used (.)with peraspect z-score normalization operation.Specifically, for each aspect, we got the sum of its occurred association for every product at the end of the time period and calculated the mean and standard deviation of them.We then performed z-score normalization on the sum of the associations of this aspect for every product at time .We used (.)as the aggregator to get the aggregated feature  , of product   at time .We maintained both the predicted logits and the binarized association prediction from s , .We incorporated a pre-trained product embedding obtained from a multi-modality neural model that encodes information from the product title and descriptions to construct the auxiliary product embedding SE  .
For simplicity, we only trained one resampling replica with  = 0.5 and match its embedding with embedding learned by standard aspect extraction replica ( = 0).We used AdamW as the optimizer and 10 −4 as the learning rate to train the aspect-guided embedding match model for 10 epochs on four 16GB Nvidia-V100 GPUs and set the number of retrieved positive associations   = 5 for better precision performance.We used a two-layer MLP with ReLU for all the embedding networks and a linear projection layer for all query and key blocks.All the embedding, query, and key vector output dimensions are 100.To avoid the over-smoothing problem, we used a GNN with one designed graph neural layer and two cascaded MLP layers.We used the same network structure for the GNN baseline.For the MLP baseline, we used a two-layer MLP with a hidden size of 128, for GRU and LSTM baseline we set the number of the layer as 2 with a hidden size of 128.For the lightweight GNN for multi-task link prediction task, we used a standard GraphSAGE layer and one cascaded MLP layer.We used Adam as the optimizer and 10 −3 as the learning rate and trained all aspect forecasting and link prediction tasks for 200 epochs on a 16GB Nvidia-V100 GPU.We first investigated the performance of ForeSeer on aspect forecasting when aspects are not annotated in the review (Table 3).We found that our method achieved the best performance on all three pieces of time gaps (3 months, 6 months, and 3 years), indicating that our temporal graph embedding framework can effectively model the dynamics of aspects and products.For instance, ForeSeer obtains 63.1% and 49.1% improvements on 90-day gap forecasting and 3-year gap forecasting, respectively.We found that the frequency baseline didn't perform well on this task, especially when the time gap is larger, further confirming the importance of aligning products temporally.We found that the GNN baseline has undesirable performance, demonstrating the effectiveness of progressive training on avoiding over-fitting.We found that the MLP, GRU, and LSTM baselines performed badly on all future forecasting, necessitating the importance of graphical information.We also found that two temporal baselines outperform the MLP baseline, suggesting the importance to adjust product embedding for multi-time forecasting.We found that OA-Mine, one of the state-of-the-art approaches in aspect mining, obtained a less prominent performance, indicating that aspect mining methods cannot be applied to this novel aspect forecasting task.Finally, we noticed that graph-based approaches GNN, in general, performed better than methods that do not consider graphs, necessitating the consideration of graph dynamics in this task.
4.2.2Improvement on aspect forecasting with annotated aspects (RQ1, RQ2).We next evaluated an easier setting where aspects are annotated in each review.In this setting, the frequency baseline achieved very good performances since it simply counts the aspects in current reviews and uses it to predict future aspects.For aspects that are not time-sensitive, the frequency baseline can be regarded as an upper bound.Nevertheless, we found that ForeSeer still achieved good performance in this setting, where ForeSeer showed comparable performance with the frequency baseline on a 6-month gap and even outperformed the frequency baseline on a 3-year gap.All other comparison approaches obtained a much less promising performance.This indicates that ForeSeer can obtain a comparable performance with the frequency baseline on aspects that are not time-sensitive and substantially better performance on aspects that are time-sensitive.To further examine this (RQ2), we illustrated one case study showing how ForeSeer can effectively recognize long-term aspects (Table 4).ForeSeer corrected forecasted time-sensitive aspects such as "a little goes a long way" and "sensitive skin", while the frequency baseline failed to identify, reassuring ForeSeer's ability to forecast long-term future aspects.4.2.3Improvement on predicting review-aspect association with embedding matching (RQ3).Next, we examined the importance of the aspect-guided embedding matching strategy by performing an ablation study on the review-aspect association prediction (Fig. 5).
We observed the limited precision performance of using either with or without resampling multi-label classification instances, leading to poor ability or focusing too much on capturing tail aspect associations.Instead, we observed substantial improvements from our aspect-guided embedding framework by successfully exploiting the advantage of two multi-label classification instances.For example, the precision of our precision-inclined embedding matching model (  = 5) is 22% and 34% higher than the with or without resampling multi-label classification instances.The superior precision performance suggests the learned review embeddings guided by aspects concentrate less-biased features for both head and tail aspects, which is critical to their success.Figure 5: Bar plot showing the performance of our embedding match approach and comparison approaches on reviewaspect association prediction evaluated using precision, F1 score, accuracy, and recall.

4.2.4
Improvement on predicting future links (RQ4).After confirming the superior performance of ForeSeer on aspect forecasting, we next investigated whether ForeSeer's PATA module is able to capture temporal information for multiple futures by letting it predict product-product edges that might emerge in the multiple future times (Table 5).We observed a consistent improvement of our methods over different time gaps compared to GNN.We adopted a multi-task prediction setting for efficient inference and training where predictions for different time gaps are treated as different tasks.Specifically, we found that the GNN baseline cannot generalize well on the intermediate-long future range (6 months) as it didn't consider the dynamics when simultaneously predicting multiple time gaps.In contrast, our model addressed this by explicitly modeling the temporal dynamics of products, confirming that ForeSeer can effectively capture both product and aspect dynamics.4.2.5 Contrasting static and temporal embeddings (RQ5).We further visualized the dynamic product embedding space colored by their product types at different time steps to assess how ForeSeer captures the temporal pattern (Fig. 4).As the time step becomes larger, we observed more visible clustering patterns among products.This suggests that the quality of the product embeddings becomes better as the data is accumulated by modeling the temporal dynamics.We also noticed the product embedding similarity reflects edges in the product-product graphs ("shelf" and "caddy", "lamp", "light fixture" and "string light"), indicating these product embeddings successfully encode graphical information.

RELATED WORKS
Product aspect mining.Few work that tries to forecast aspects for products and most previous works only aim at product aspect mining tasks.Mining product aspects from large-scale commercial data has been a well-studied problem [7,15,25,53,77].Previous works target the problem using rule-based [7,25,53], supervised learning based extractor [4,18] or propagation-based methods [35,46].These works are less generalizable to product aspect forecasting, as they require either domain-specific features or supervision signals from downstream tasks such as sentiment or opinion labels.Recently, other works similarly extract descriptive product attributes from seller-provided product profiles by using sequential supervised textual span labeling information [29,64,70,76,93] or distant supervision [65,83].OA-Mine [84] leverage weakly-supervised seed set to mine aspects from product titles.These methods need either more costly massive sequential labeled datasets or predesigned hierarchical attribute taxonomy, which is hard to obtain for customercentered aspects.Zero-shot aspect extraction is another recently emerged related topic that aims to extract aspects in new domains without annotated data by leveraging transfer learning [17], natural language inference [54], document sentiment classification [9] and the recent emerged super-scale ChatGPT [44] and GPT-4 [2].However, the zero-shot indicates that no prior knowledge of the new domain is needed instead of the number of reviews being limited for products.They are thus not directly applicable to forecasting tasks.In summary, all the methods above focus on text sequence level extraction and do not efficiently aggregate learned attributes at the product level.They are thus useful to help construct the aspect list for product aspect forecasting tasks, but are not able to forecast future aspect trends.Unlike these methods, our methods are the first framework that can forecast aspects that might be mentioned in future reviews for a new product with a limited number of reviews.Label-guided classification.Label-guided text classification has been studied in fields such as social recommendation and document classification [6,32].It has been proven to be beneficial for classification performance [37,68].While these methods focus on classification, our method provides an efficient cross-domain contrastive learning framework that can easily ensemble these models as expert multi-label classification instances.Compared to [63], which also proposes to mix multiple encoder instances, our representation learning scheme allows the model to also produce informative reviews and aspect representations to help downstream tasks.
Temporal graph learning.Temporal graph-based representation learning has been intensively studied [14,49,74].Previous works either incline on graph structure [28,43,52,87,91] or temporal dynamics [21,38,41,59,60,72].The dynamic graph structure is extracted by graph adjacency dynamics [12,30], skip-gram-based modeling [11,39,95], leveraging clique information [16,81,90], step-by-step embedding updating [3,13,31,33,47,55,92].These methods only focus on capturing implicit or short graphical dynamics changing and can not efficiently model large-scale reviewaspect associations and evolving product-review associations.Some methods are designed specifically for heterogeneous networks [19,66,82,85,86].Temporal-dynamic focused methods leverage recurrent-based models [1,8,23,67,94], continuous point process modeling [5,50] and self-supervised graph representation learning [57,58].These methods focus on modeling interactive events between nodes, but are not suitable for modeling dynamic product similarities and massive product-review associations.Some methods leverage self-attention, hierarchical, or temporal attention [20,36,51,69,75,78,89], but only for temporal information aggregation, while our PATA module leverages attention to predict multiple time range future simultaneously.Other recent works such as EvolveGCN [45] and Roland [80] incorporate recurrent modules and efficient graph learning.These frameworks can be easily adapted to our progressive training framework, while we aim to clearly show that progressive training is the key success to in aspect forecasting with a simple yet clear formulation.JODIE [27] is the closest work that can also predict multiple time-range futures with a time projection module.However, it is based on interaction events and is designed for heterogeneous user-item networks, making it impractical for our use case.In summary, most of the existing works can not be directly applied to the product aspect forecasting task because of the training efficiency and inability to handle large-scale reviews.

CONCLUSION
In this paper, we have studied a novel task of product aspect forecasting, which aims to predict aspects that users might mention in future reviews.We have proposed a novel framework ForeSeer to solve this task by dynamically embedding products, reviews, and aspects.We have evaluated our method on a large-scale real-world dataset and observed ForeSeer's superior performance on aspect forecasting and link prediction.In the future, we are interested in boosting ForeSeer with more advanced progressive training strategies.We are interested in applying ForeSeer to other temporal graph embedding frameworks, such as modeling biological signaling pathways.We are further interested in exploring how ForeSeer can assist the classic task of aspect mining, allowing us to apply ForeSeer to advance a larger number of e-commerce applications, such as integrating it with recommendation systems to enhance the product suggestions based on anticipated aspect preferences.

Figure 1 :
Figure 1: Problem setting of aspect forecasting.The product graph is evolving because new product nodes and edges are added to the graph.The edge weights also change over time.

Figure 2 :
Figure 2:The key idea of ForeSeer is to find and embed similar products dynamically and propagate their top aspects mined from reviews to the new item to forecast future aspects.

Figure 3 :
Figure3: ForeSeer first resamples review-aspect associations and co-embed aspects and reviews.It then progressively embeds products leveraging temporal product graphs.Finally, it exploits temporal information to adjust product and aspect embeddings to provide multi-time future forecasting.

Figure 4 :
Figure 4: UMAP plots visualizing product embeddings at day 0 (a), day 30 (b), day 300 (c), and aggregation of product embeddings at all of the three time points (d).
product for my wife and it looks perfect for me.Summary Carpet cleaning Great carpet machine did a great job cleaner cleaning my carpet This Compact Carpet Cleaner did a great job cleaning my carpet !

Table 2 :
Head and tail aspect examples with frequency information over a three-year period.Predicting future aspect trends for products.In practice, products are released at different times and start receiving reviews over time.Consider in a discrete time period T = {0, . . ., }, the product network is evolving and G  = {P  , E  } is the product network snapshot at time  ∈ T .  ∈ P  >   , are also expanding over time.Any review  , in  , = { ,1 , . . .,  ,| , | } also has a proposed time

Aspect 1 .
Wearing this soft shirt is relaxing... 2. Quality is good.The inside of it is really soft... 3. It's soft, comfy but not warm....

Table 3 :
Performance of ForeSeer and comparison approaches on aspect forecasting at three different time gaps (3 months, 6 months, 3 years) under settings of with annotated aspects and without annotated aspects.

Table 4 :
Case study showing the aspected we predicted for skin moisturizer.Our method identified time-sensitive phrases (red) while the frequency baseline failed to identify them.