skip to main content
research-article
Open Access

Investment and Risk Management with Online News and Heterogeneous Networks

Published:27 March 2023Publication History

Skip Abstract Section

Abstract

Stock price movements in financial markets are influenced by large volumes of news from diverse sources on the web, e.g., online news outlets, blogs, social media. Extracting useful information from online news for financial tasks, e.g., forecasting stock returns or risks, is, however, challenging due to the low signal-to-noise ratios of such online information. Assessing the relevance of each news article to the price movements of individual stocks is also difficult, even for human experts. In this article, we propose the Guided Global-Local Attention-based Multimodal Heterogeneous Network (GLAM) model, which comprises novel attention-based mechanisms for multimodal sequential and graph encoding, a guided learning strategy, and a multitask training objective. GLAM uses multimodal information, heterogeneous relationships between companies and leverages significant local responses of individual stock prices to online news to extract useful information from diverse global online news relevant to individual stocks for multiple forecasting tasks. Our extensive experiments with multiple datasets show that GLAM outperforms other state-of-the-art models on multiple forecasting tasks and investment and risk management application case-studies.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Financial forecasting tasks, i.e., forecasting financial time-series such as stock returns or volatilities at the next timestep or some time horizon in the future, are more challenging than other forecasting tasks due to the low signal-to-noise ratios and the non-stationary nature of financial time-series distributions and inter-series relationships [19]. Financial time-series are also influenced by diverse sources of information, such as local (company-specific) stock prices, global (non-company-specific) news, and network effects due to inter-company relationships, i.e., networks comprising company nodes with inter-company relationships forming edges between the company nodes. In this article, we address these challenges and propose a model designed for multiple financial forecasting tasks.

While structured numerical information has traditionally been used as information for forecasting stock price movements, the influence of unstructured textual news information on stock price movements has also been studied in recent works [1, 11, 23, 43, 55]. These works show that both traditional and online textual news contain valuable information that can be used to forecast stock price movements. Extracting useful information from online news for financial tasks is, however, more challenging than from structured numerical information or traditional textual news information for a number of reasons.

First, sources of online news are more diverse, ranging from online news outlets to blogs and social media, with vastly different writing styles. Second, the signal-to-noise ratios of online news is lower. A larger proportion of online news is of low quality or false, as most online news are not subject to the same editorial processes and controls as traditional news. Finally, the relevance of each online news article to individual stock prices is difficult to determine, even for human experts, due to the high volume, velocity, and variety of online news.

To address the above challenges, this article introduces several key ideas premised on the following observations relating to online news information, financial time-series, and inter-company networks:

  • First, unlike a piece of structured information such as stock price and trading volume histories, which are explicitly associated with specific companies and thus local in nature, a piece of unstructured online news may not be associated with specific companies but relevant to multiple companies, an entire industry sector, or the entire market. Hence, we consider these online news global. For example, a published news article on disruptions in the operations of a main semiconductor supplier can affect many other companies not only in the semiconductor sector, but also other companies in the downstream industries such as automobiles and computing equipment. Hence, global information could influence local information. This degree of global information’s influence differs based on the relevance and quality of global information, e.g., high-quality news on a company that is verified to be true will have a more sustained effect on the stock prices of the company than low-quality news that turns out to be false. Local information may also influence global information, e.g., a sustained decline in stock prices or trading volumes of a key company in a property sector could lead to news articles discussing the vulnerabilities of the entire sector. Such mutual effects of local information and global information could be leveraged for extraction of relevant information.

  • Both global and local information often come from different modalities, e.g., local numerical price-related information and global textual news information, as mentioned earlier. As the evolution of local stock-specific information from one modality could provide important signals that can be leveraged to extract relevant global information from another modality, a multimodal approach to modeling them is thus necessary.

  • Next, the ex-post effects of local information can provide valuable signals that can be utilized to extract relevant global information and address the low signal-to-noise ratio, e.g., a personal scandal related to the CEO of Company A might not be related to its core business, but could spark an online-led boycott of Company A’s products and lead to a significant decline in its stock price. Such significant ex-post responses of local information, i.e., stock prices, therefore provide an important set of signals that can be isolated to learn the relevance of different global information to each company.

  • Fourth, different types of inter-company relationships and linkages capture different influences among them. For example, an online negative news article about a Company B could have negative effects on the stock prices of all its suppliers and customers, but positive effects for its known competitor; a regulatory crackdown on Company C based in a country due to a shift in government policy could affect companies of the same industry in the same country negatively but have positive effects for companies of the same industry in other countries. Such heterogeneous relationships between companies, both direct and indirect, can provide structure for learning the relevance of different online news to individual stock prices.

We illustrate these observations with an example in Figure 1. News on the suspension of Alibaba’s Ant Financial’s initial public offering (IPO) in October 2020 led to a significant decline in the stock price of Alibaba (green line) after that. Stock prices of its competitors—Baidu (blue line), Amazon (orange line), and Google (red line)—in the technology sector rise, but to varying degrees after the event. Baidu, being a competitor to Alibaba in the same home country, i.e., China, appears to increase the most (see the different types of relationships in Figure 1). The longer-term effect of news on the suspension of Alibaba’s Ant Financial’s IPO on the stocks prices of the different companies also varies. The decline in Alibaba’s stock price and rise in Baidu’s stock price are more sustained, whereas the changes in stock prices of Google and Amazon level off after November 2020. Stock prices of Pfizer (purple line), a pharmaceutical company without direct links to Alibaba, is relatively unaffected by this news.

Fig. 1.

Fig. 1. Motivating example: Effects of news relating to Alibaba’s Ant Financial’s initial public offering (IPO) being suspended on the immediate stock prices of different companies varies based on heterogeneous relationships. The significance and duration of such effects also varies based on the news content and such heterogeneous relationships.

In our literature survey, we have noticed several limitations of existing works. Most existing works in this area model financial information of a single modality [13, 23, 43] and do not model the mutual relevance of information from different modalities, or the effects of heterogeneous inter-company relationships. Some works [17, 55] model both unimodal financial information and the effects of inter-company relationships, but not multimodal information. Ang and Lim [1] utilizes multimodal numerical and global textual information as well as inter-company relationships but does not capture the heterogeneity of inter-company relationships. All these works also do not explicitly isolate the more significant ex-post responses of stock prices to learn the relevance of different global information to each company.

Most existing works also focus on forecasting stock prices or returns for trading decisions at the next timestep and do not study the effect of online news on the dynamics of stock prices over a longer future horizon, covering multiple timesteps [23, 43]. Hence, another set of key ideas that motivates our article relates to the need for a multivariate multitask setting for forecasting the dynamics of stock prices over a longer future horizon. Such a setting is important for investment and risk management applications such as portfolio management and risk forecasting. Investment and risk managers are not only interested in investment returns but also investment risks and other aspects of investment and risk management. Investment and risk managers make investment and risk management decisions over a longer term horizon and are hence also interested in the expected returns and risks (volatilities) of stocks over a longer term horizon, rather than just stock price movements at the next timestep. Investment and risk managers also manage large numbers of stocks in portfolios and are hence also interested in how changes in correlations between stocks affect the overall returns of the portfolio. Therefore, financial forecasting for investment and risk management naturally involves a multivariate multitask setting, where there is a need to manage the returns and risks of financial portfolios that comprise many stocks, and forecast stock mean returns and risks over a future horizon to balance potential returns and risks when making investment decisions, as well as forecast correlations between stocks in portfolios over a future horizon. Designing a model that can be used in a multivariate multitask setting has other potential advantages, as it could enable complementary information from other variables and related tasks to be used to improve overall forecasting performance, and also lower the risk of over-fitting on any one task. Such forecasts can also be utilized in portfolio allocation optimization [35] and Value-at-Risk (VaR) [32] forecasting applications.

To address the above-mentioned challenges in utilizing global online news information in a multivariate multitask setting for investment and risk management based on these key ideas, we propose the Guided Global-Local Attention-based Multimodal Heterogeneous Network (GLAM) model. GLAM incorporates several important components: (i) a time-sensitive global-local transformer to learn relevant global online text information and sequentially encode multimodal information (i.e., time-series stock prices and time-stamped news articles); (ii) an attention-based heterogeneous network encoder to leverage heterogeneous inter-company relationships and future correlations; coupled with (iii) auxiliary channels for guided learning from significant ex-post effects of online news. GLAM is trained in a multivariate setting on multiple tasks—forecasting means, volatilities, and correlations over a future horizon. We also demonstrate how such forecasts could be used for portfolio allocation optimization and risk management applications in case-studies. Our key contributions are as follows:

  • To our knowledge, this is the first work to propose a model for capturing global and local information from multiple modalities and heterogeneous networks for multivariate multitask financial forecasting tasks for investment and risk management applications.

  • We propose a time-sensitive global-local transformer module that encodes sequences of global textual information, local numerical information, and associated time features jointly and extracts the relevant global information.

  • To improve extraction of relevant global information, we couple the global-local transformer module with auxiliary channels that enable the significant changes in stock dynamics to be isolated for guiding the learning of global information relevant to each company.

  • We design a heterogeneous network encoding module that uses different types of inter-company relationships to propagate multimodal sequential information across companies. The heterogeneous network encoding module also leverages correlation forecasts to improve parameter learning.

  • We train the model on multiple forecasting tasks to lower the risk of over-fitting and demonstrate the effectiveness of GLAM on forecasting tasks and real-world applications against state-of-the-art baselines on real-world datasets.

Skip 2RELATED WORK Section

2 RELATED WORK

As this work involves time-series forecasting and network learning, we review key related works in these areas.

2.1 Stock Price Forecasting Using Time Series Modeling

Classical methods, which include univariate Autoregressive Integrated Moving Average (ARIMA) [47], Generalized AutoRegressive Conditional Heteroskedastic (GARCH) [5]; and multivariate Vector Auto-Regressive (Vector AR) [34] and Dynamic Conditional Correlation (DCC)-GARCH [14] models are commonly applied to time-series forecasting. However, such classical methods are designed for numerical data but not unstructured textual information.

To learn time-series information in a data-driven manner and to capture other types of information, deep learning models have been increasingly applied to time-series forecasting. They include feed-forward networks [8, 10, 11, 36, 58], convolutional neural networks [2, 6, 38, 51], recurrent neural networks [18, 31, 33, 41, 42], and transformers [53, 60]. A detailed review of these works can be found in Faloutsos et al. [16], Jiang [24], Lim and Zohren [30], Özen et al. [37], Petropoulos et al. [40], Torres et al. [46]. TST [60] is a recent model based on the transformer encoder architecture. It is, however, designed for local numerical information. A number of recent works have studied the use of textual information from traditional and online news [1, 11, 13, 23, 43, 55] for financial forecasting. Most of these works [23, 43] utilize news articles that have been manually tagged to specific companies as inputs, i.e., local textual information. FAST [43] is a recent model that uses Time-aware LSTMs [3] to encode sequences of local textual news information. HAN [23] utilizes attention mechanisms to learn the importance of each local news article and each timestep. SE [13] is a recent model that does not manually assign company tags to news articles but instead uses the dot product of stock embeddings and news representations to extract relevant global news via a data-driven approach. SE utilizes bidirectional GRUs to encode unimodal textual information but does not capture numerical information or inter-company relationships.

2.2 Heterogeneous Network Learning and Stock Price Forecasting

Graph Neural Networks (GNN) compose messages based on network features and propagate them to update the embeddings of nodes and/or edges over multiple neural network layers [20]. In particular, Graph Convolutional Network (GCN) [27] aggregates features of neighboring nodes and normalizes aggregated representations by node degrees. Graph Attention Network (GAT) [50] assigns neighboring nodes with different importance weights during aggregation. Such GNNs are designed for homogeneous networks with static node attributes and cannot be directly applied to heterogeneous networks where attributes are evolving time series.

GNNs have also been applied to heterogeneous networks. Relational Graph Convolutional Networks (RGCN) [44] and Graph Convolutional Matrix Completion [48] use multiple GCNs to encode embeddings of multiple adjacency matrices, one for each edge type, before aggregating them. Heterogeneous Graph Attention Network [52] and General Attributed Multiplex Heterogeneous Network [7] use multiple GNN-based layers to encode networks formed from different metapaths [12] before using an attention mechanism to aggregate the embeddings. HGT [22] uses attention mechanisms to encode sub-graphs with different node and edge-types iteratively. Similarly, such GNNs are designed for heterogeneous networks with static node attributes and cannot be directly applied to heterogeneous networks where attributes are evolving time series. GNN models have been designed for spatio-temporal networks where the nodes have time-varying attributes [9, 28, 59, 61]. However, these models are designed for traffic-related numerical information and not suitable for networks where the node attributes are multimodal financial time series.

A few recent works extend different GNNs to prediction tasks on financial time-series data. RSR [17] uses LSTM to generate representations for local numerical time-series stock prices before feeding the latter to learn stock embeddings in a heterogeneous network using a GCN-based model but does not consider global textual information. Reference [55] captures heterogeneous relationships using a RGCN-based model but is designed for forecasting with different types of company announcements and not news. Our earlier work KECE [1] captures numerical and global textual information and uses a GAT-based model to capture homogeneous inter-company relationships but does not capture the heterogeneity of inter-company relationships. KECE also uses the dot product of stock embeddings and news representations to extract relevant global news information, rather than attention mechanisms based on transformers proposed in this work.

In general, these related forecasting and network learning works are also not designed for the multitask setting of forecasting means, volatilities, and correlations of stock returns that are important for investment and risk management applications. They are designed for single tasks, and either predict prices or returns or price movement directions for trading applications. Further, they do not isolate the significant ex-post changes in local stock dynamics to improve the learning of global information relevant to each company.

Skip 3GUIDED GLOBAL-LOCAL ATTENTION-BASED MULTIMODAL HETEROGENEOUS NETWORK MODEL Section

3 GUIDED GLOBAL-LOCAL ATTENTION-BASED MULTIMODAL HETEROGENEOUS NETWORK MODEL

GLAM represents companies in a network \(G=(V,E,X)\), where V represents a set of company nodes, E consists of edges of R different relationship types, i.e., \(E= E_{1} \cup \cdots \cup E_{R}\), X represents sequences of multimodal numerical and textual attributes. In this article, we utilize heterogeneous relationships between companies extracted from Wikidata knowledge graphs. Other inter-company relationships, e.g., based on domain knowledge, can also be used, but will be explored in future work. Given a timestep t, we define the local numerical features of a company \(v_j\), \(X^{num,local}_{j}(t)\), to be the sequence of numerical price-related data associated with \(v_j\) over a window of K timesteps up to \(t-1\), i.e., \(X^{num,local}_{j}(t) = [x^{num,local}_{j}(t-K), \ldots ,x^{num,local}_{j}(t-1)]\). We use \(X^{num,local}(t)\) to represent the \(|V| \times K\) matrix containing the \(X^{num,local}_{j}(t)\) of all companies. The pre-encoded textual news features that are global in nature and not associated with any company are denoted as \(X^{txt,global}(t) = [x^{txt,global}(t-K), \ldots ,x^{txt,global}(t-1)]\) over the same window period \([t-K,t-1]\), with varying number of news articles \(|N_{t-k}|\) at each timestep \(t-k\) in \([t-K,t-1]\). Alternative local and global inputs, e.g., local social media information such as tweets from the company’s social media account, and global economic indicators, e.g., gross domestic product of countries of the company’s key markets, are also possible within our framework, but we focus on global online news and local stock-price related numerical information in this article and will explore other inputs in future work.

As shown in Figure 2, the Guided Global-Local Transformer (GLT) module in GLAM first utilizes local numerical information \(X^{num,local}(t)\) (over the time window \([t-K,t-1]\)) to extract and sequentially encode relevant global textual information \(X^{txt,global}(t)\) in the same time window. The resultant sequence of representations for each company (\(H_i(t)[t-K], \ldots , H_i(t)[t-1]\)) are then used as inputs to a Heterogeneous Network Encoding (HNE) module that captures the heterogeneous relationships between companies. GLAM finally generates forecasts of means, volatilities, and correlations of financial returns of each company \(v_i\) over a selected future horizon of \([t,t+K^{\prime }]\). These financial returns are denoted as \( \begin{equation*} Y^{returns}_i(t) = [y^{returns}_i(t), \ldots ,y^{returns}_i(t+K^{\prime })], \end{equation*} \) where \(y^{returns}_i(t)=(price_i(t) - price_i(t-1))/price_i(t-1)\) and \(price_i(t)\) denote the percentage return and stock price of \(v_i\) at time window [\(t,t+K^{\prime }\)] and timestep t, respectively. We also use \(Y^{returns}(t)\) and \(y^{returns}(t)\) to denote the percentage return of all companies at time window [\(t,t+K^{\prime }\)] and timestep t, respectively. To facilitate GLT and HNE in learning the relevant global textual information and important heterogeneous relationships, we add auxiliary channels for intermediate guided learning to the GLAM model. The auxiliary channels associated with GLT utilize intermediate forecasts of the most significant means and volatilities of \(Y^{returns}(t)\) across all stocks to guide learning of GLT parameters. HNE is also guided by the learning of an inner weight \(W_{att}\) that is utilized in both the heterogeneous network encoding as well as the forecasts of the correlations of \(Y^{returns}(t)\). We further elaborate on the GLAM modules, auxiliary channels, and training objectives below.

Fig. 2.

Fig. 2. Architecture of GLAM. Detailed Architecture of Guided Global-Local Transformer (GLT) and Heterogeneous Network Encoder (HNE) modules are depicted in Figures 3 and 4, respectively.

3.1 Guided Global-Local Transformer

Transformers [49] were originally proposed for natural language applications but have since been extended to time-series forecasting [29, 53, 60]. Such time-series transformer works [29, 53, 60] use local information (usually numerical), i.e., information directly associated with specific companies, or other variables. To extract and learn global textual information relevant to each company, we design the Guided Global-Local Transformer (GLT), that differs from these past transformer-related works in a number of important aspects, which we elaborate on below.

As shown in Figure 3, the GLT module in Figure 3(i) comprises multiple GLT layers shown in detail in Figure 3(ii). We start from the GLT layer as shown in Figure 3(ii).

Fig. 3.

Fig. 3. (i) Guided Global-Local Transformer (GLT) comprises multiple GLT layers, which repeatedly extract relevant global representations across all companies with residual local backcasts as inputs and sums them up to obtain the final relevant global representations \(H(t)\) for all timesteps in the window \([t-K,t-1]\) . (ii) Each GLT layer iteratively extracts relevant global representations across the window from \(t-1\) to \(t-K\) step in a time-sensitive manner and generates backcasts. Figure (ii) shows the process for the \(t-k\) step, which is repeated from the \(t-1\) to the \(t-K\) step.

Time vectorization and projection. First, we utilize learned time matrices [21, 25] to enable GLT to generate time-sensitive representations. Unlike the usual positional encodings used in transformers, the time matrix \(P(t)\) is learned from the set of timestamps \(T(t)\) corresponding to the day of week, week and month of year of timesteps [\(t-K,t-1\)], as these are most relevant to the respective inputs. The time matrix \(P(t) \in \mathbb {R}^{K \times d}\) is learned by combining functional forms and learnable weights. For GLAM, the empirically chosen functional components are \(\Phi _{1}=sigmoid(Linear(T(t)))\) and \(\Phi _{2}=cos(Linear(T(t)))\), which enable the model to extract non-linear and seasonality-based temporal patterns in \(T(t)\). We then concatenate these matrices \(\Phi _{1}\) and \(\Phi _{2}\) and project them via a linear layer to obtain the time matrix: \(P(t) = Linear\) \(([\Phi _{1} || \Phi _{2}])\).

Concurrently, we project either (i) the original local numerical features \(X^{num,local}(t)\) with dimension \(d^{num}\) corresponding to stock market price-related information, say, opening, closing, low, high prices, and trading volumes; or (ii) the residual local numerical features, i.e., the difference between \(\hat{X}^{num,local,(l-1)}(t)\) and \(\hat{X}^{num,local,(l-2)}(t)\) of the prior layers (which we will elaborate on later in this section) to representations \(H^{num,local}(t)\) with dimension d. We also project \(X^{txt,global}(t)\) with dimension \(d^{txt}\) corresponding to the dimensions of average word embeddings of each news article generated with a pre-trained encoder to \(H^{txt,global}(t)\) with dimension d again but of different sequence lengths. That is, \(H^{num,local}(t) \in \mathbb {R}^{|V| \times K \times d}\), \(H^{txt,global}(t)\in \mathbb {R}^{\sum _{k=1}^K |N_{t-k}| \times d}\). \(H^{txt,global}(t)\) is not specific to any company, and there are varying \(|N_{t-k}|\) number of global news representations for each timestep. We broadcast the time matrix \(P(t) \in \mathbb {R}^{K \times d}\) by repeating it \(|V|\) times, resulting in a \(|V| \times K \times d\) matrix, and add it to the local numerical representations to obtain: \(\tilde{H}^{num,local}(t)=H^{num,local}(t) + P(t)\). We similarly broadcast the time matrix to match the dimensions of the global news representations, i.e., \(\sum _{k=1}^K |N_{t-k}| \times d\), and add them to the global news representations to obtain: \(\tilde{H}^{txt,global}(t)=H^{txt,global}(t) + P(t)\).

Cross attention. Second, in contrast to transformers that apply self-attention to the same sequence of representations that are of equal length, we apply cross-attention between the local numerical representation and \(|N_{t-k}|\) global news representations at each timestep \(t-k\) in time window \([t-K, t-1]\). We use the local representations \(\tilde{H}^{num,local}(t)\) to guide the learning of global textual features from \(\tilde{H}^{txt,global}(t)\) relevant to each company \(v_i\). As shown in Figure 3(ii), for each timestep in window \([t-K,t-1]\), say, \(t-k\), we generate: \( \begin{equation*} Q^{num,local}(t)[t-k]=Linear_{Q}(\tilde{H}^{num,local}(t)[t-k]) \end{equation*} \) \( \begin{equation*} K^{txt,global}(t)[t-k]=Linear_{K}(\tilde{H}^{txt,global}(t)[t-k]) \end{equation*} \) \( \begin{equation*} ~V^{txt,global}(t)[t-k]=Linear_{V}(\tilde{H}^{txt,global}(t)[t-k]). \end{equation*} \) We apply a scaled dot-product attention weighted aggregation step that is different from standard transformers [49]:

(1)
where \(W^{gatt}\in \mathbb {R}^{d \times d}\) is an inner weight shared across all timesteps in window \([t-K,t-1]\) to improve attention extraction of global textual information. The matrix multiplication between \(K^{txt,global}(t)[t-k]\), \(W^{gatt}\) and \(Q^{num,local}(t)[t-k]\), after the scaling by \(\sqrt {d}\) and the \(softmax\) step, gives us attention weights of dimensions \(|N_{t-k}| \times |V|\). We then use the transpose of these attention weights to map the \(|N_{t-k}| \times d\) matrix \(V^{txt,global}(t)[t-k]\) to \(\acute{H}^{txt,global}(t)[t-k],\) which is of \(|V| \times d\) dimensions. Across all timesteps \(t-k\)’s in window \([t-K,t-1]\), we get \(\acute{H}^{txt,global}(t) \in \mathbb {R}^{|V| \times K \times d}\). We then apply a series of addition steps, layer normalization, and feed-forward networks as per conventional transformers [49] to \(\acute{H}^{txt,global}(t)\) as follows: \(H^{\prime }(t) = LayerNorm(\acute{H}^{txt,global}(t) + \tilde{H}^{num,local}(t))\); followed by \(H^{\prime \prime }(t) = LayerNorm(FFN(H^{\prime }(t))+H^{\prime }(t))\).

Backcast. Third, in contrast to transformers layers that only generate representations for subsequent layers to encode or decode, the GLT layer not only generates \(H^{(l)}(t) = Dense(H^{\prime \prime }(t))\) with \(H^{(l)}(t) \in \mathbb {R}^{|V| \times K \times d}\) but also generates a backcast of the local numerical information \(\hat{X}^{num,local,(l)}(t) = BC(H^{\prime \prime }(t))\) where \(\hat{X}^{num,local,(l)}(t) \in \mathbb {R}^{|V| \times K \times d^{num}}\).

Residual stacking. Fourth, as shown in Figure 3(i), instead of the typical encoder-decoder architecture in transformers, we stack multiple GLT layers with a residual connection between the lth GLT layer’s backcast of the local numerical information: \(\hat{X}^{num,local,(l)}(t)\) and the local numerical information used as inputs to the lth GLT layer: \(\hat{X}^{num,local,(l-1)}(t)\). The difference between the backcast of the local numerical information and the prior local numerical information \(\hat{X}^{num,local,(l-1)}(t)\) is used as inputs to the subsequent \(l+1\) GLT layer; while the representations of relevant global information \(H^{(l)}(t)\) for each of the GLT layers are added to obtain the final \(H(t)=\sum _{l=1}^{L}H^{(l)}(t) (\in \mathbb {R}^{|V| \times K \times d})\). This residual stacking architecture is inspired by Reference [36], but to our best knowledge has thus far not been applied for the extraction of relevant global information using transformers. The multiple GLT layers can be viewed as multi-stage extraction of global information that are relevant to each company. The residual connection removes the part of the local information that has already been utilized to extract part of the global information relevant to each company and facilitates the extraction task of the subsequent GLT layers.

Intermediate forecasts. Finally, to enable the GLT module to effectively extract global information relevant to each company, we add fully connected layers \(FC^{\prime }_M\) and \(FC^{\prime }_V\) and use the representation of the final step \(t-1\) in the window \([t-K,t-1]\) to make intermediate forecasts of the most significant means and volatilities, respectively, of \(Y^{returns}(t)\) across all stocks as follows: \( \begin{equation*} \hat{Y}^{\prime returns}_{mean}(t) = FC^{\prime }_M(H(t)[t-1]) \end{equation*} \) \( \begin{equation*} \hat{Y}^{\prime returns}_{vol}(t)=FC^{\prime }_V(H(t)[t-1]). \end{equation*} \) The training loss functions for these intermediate forecasts will be elaborated in Section 3.3. The intermediate forecasts are also designed to alleviate the over-smoothing in the subsequent network encoding step [54], which causes representations for all nodes in a network to become very similar to one another and can lead to poorer performance.

3.2 Heterogeneous Network Encoder

The heterogeneous network encoder utilizes the heterogeneous relationships between companies to propagate representations between companies based on different relationship types \(\lbrace 1, \ldots , R\rbrace\). Inspired by References [22, 57], we adapt the scaled dot-product attention module commonly used in transformers [49] for the GNN message-passing framework to design a Heterogeneous Network Encoder (HNE), as shown in Figure 4.

Fig. 4.

Fig. 4. Heterogeneous Network Encoder (HNE).

The scaled dot-product attention mechanism used in transformers applied to networks utilizes network structure to compute attention scores between nodes that are neighbors. These attention scores weigh the messages propagated from the source to target nodes for aggregation. Using scaled dot-product attention is more effective than the usual message-passing framework employed in most GNNs, as it allows the model to perform the message composition, propagation, and update steps in the GNN message-passing framework based on the self-discovered relative importance of each neighboring source node and relationship-types.

HNE extracts edges linking neighboring source company nodes \(v_s\)’s to a target company node \(v_x\) as canonical triplets \(\langle v_s, r, v_x \rangle\)’s (i.e., \(v_s,v_x \in V\), \(r \in \lbrace 1, \ldots , R\rbrace\)) from the heterogeneous network. For each canonical triplet, we first utilize linear layers to encode \(H(t)\) of the company nodes in time window [\(t-K,t-1\)] from the prior GLT module as queries, keys, and values: \( \begin{equation*} Q_{v_x}(t)=Linear_{Q-HNE}(H_{v_x}(t)) \end{equation*} \) \( \begin{equation*} K_{v_s}(t)=Linear_{K-HNE}(H_{v_s}(t)) \end{equation*} \) \( \begin{equation*} V_{v_s}(t)=Linear_{V-HNE}(H_{v_s}(t)). \end{equation*} \)

We then compute the attention score \(AttScore\) between a target node \(v_x\) and each neighboring source node \(v_s \in N(v_x)\) as:

(2)
where \(N(v_x)\) denotes the neighboring nodes of \(v_x\), and \(W_{att,r}\) is a \(d \times d\) learnable weight matrix. Next, we use the attention score \(AttScore\) to compute the weighted average of features from all source nodes and use it to update the triplet-specific representation of the target node \(v_x\). (3) \(\begin{equation} H_{\langle v_s, r, v_x \rangle }(t) = \sum _{v_s \in N(v_x)} AttScore_{\langle v_s, r, v_x \rangle }(t)\cdot V_{v_s}(t) \end{equation}\)

At this point, we have the embeddings of the target node \(v_x\) for each of the canonical triplets or edges connected to neighboring nodes \(N(v_x)\). To learn the importance of different edges, we use attention-based fusion. A non-linear transformation is applied to the representations to obtain scalars \(s(v_s, r, v_x,t) = W_{\omega }^{(1)} tanh(W_{\omega }^{(0)} H_{\langle v_s, r, v_x \rangle }(t) + b_{\omega })\), where \(W_{\omega }^{(0)}\) and \(W_{\omega }^{(1)}\) are learnable weight matrices and \(b_{\omega }\) is the bias vector. Parameters are shared across modalities. We normalize the scalars with a softmax function to obtain the weights \(\beta _{(v_s, r, v_x)}(t)\)’s, which are used to fuse representations across the edges into node \(v_x\) as \(z_{x}(t)\)’s. (4) \(\begin{equation} \beta _{(v_s, r, v_x)}(t) = \frac{exp(s(v_s, r, v_x,t))}{\sum _{(v_s, r, v_x)} exp(s(v_s, r, v_x,t))} \end{equation}\) (5) \(\begin{equation} z_{x}(t) = \sum _{(v_s, r, v_x)} \beta _{(v_s, r, v_x)}(t) H_{\langle v_s, r, v_x \rangle }(t) \end{equation}\) Repeating these steps across all company nodes and edges results in \(Z(t) \in \mathbb {R}^{|V| \times K \times d}\). We use the representation of the final step \(t-1\) in the window \([t-K,t-1]\), i.e., \(Z(t)[t-1]\in \mathbb {R}^{|V| \times d}\) in the subsequent steps. Similarly, across all company nodes and edges, we learn R attention weights \(W_{att,r}\), each of dimension \(d \times d\), and stack them to get \(W_{att}\) of dimension \(d \times d \times |R|\).

3.3 Forecasting and Loss Functions

We use fully connected layers to generate final forecasts of means and volatilities of stock returns over the selected horizon period \([t,t+K^{\prime }]\): \( \begin{equation*} \hat{Y}^{returns}_{mean}(t) = FC_{M}(z_{t-1}) \end{equation*} \) \( \begin{equation*} \hat{Y}^{returns}_{vol}(t) = FC_{V}(z_{t-1}), \end{equation*} \) where \(z_{t-1}=Z(t)[t-1]\). As described in Section 3.2, we stack R attention weights \(W_{att,r}\), each of dimension \(d \times d\), from HNE to get \(W_{att}\) of dimension \(d \times d \times |R|\) and forecast correlations of asset returns over the horizon period \([t,t+K^{\prime }]\) as:

(6)
where \(FC_{C}\) is a fully connected layer that projects \(W_{att}\) from dimensions \(d \times d \times |R|\) to dimensions \(d \times d\).

The final forecasts \(\hat{Y}^{returns}_{mean}(t)\), \(\hat{Y}^{returns}_{vol}(t)\), and \(\hat{Y}^{returns}_{corr}(t)\) are utilized alongside the intermediate forecasts \(\hat{Y}^{\prime returns}_{mean}(t)\) and \(\hat{Y}^{\prime returns}_{vol}(t)\) from GLT, as described in Section 3.1, to learn GLAM’s parameters. Doing so enables GLT to learn to extract the relevant global news information well and hence improve the representations that are propagated between companies based on different relationship types in HNE.

The respective ground-truths, i.e., actual means, volatilities, and correlations over the horizon \([t,t+K^{\prime }]\) for each stock, i.e., \(v_i\), are computed from the observed stock returns as follows: (7) \(\begin{equation} y^{returns}_{mean,i}(t) = \frac{1}{K^{\prime }}\sum ^{K^{\prime }}_{k^{\prime }=0}y^{returns}_i(t+k^{\prime }), \end{equation}\) (8) \(\begin{equation} y^{returns}_{vol,i}(t) = \sqrt {\frac{1}{K^{\prime }}\sum ^{K^{\prime }}_{k^{\prime }=0}(y_i^{returns}(t+k^{\prime })-\mu _i)^2}, \end{equation}\) where \(\mu _i = y^{returns}_{mean,i}(t)\). For correlations between any two companies i and j: (9) \(\begin{equation} y^{returns}_{corr,i,j}(t) = \frac{\sum ^{K^{\prime }}_{k^{\prime }=0}(y^{returns}_i(t+k^{\prime })-\mu _i)(y^{returns}_j(t+k^{\prime })-\mu _j)}{\sqrt {\sum ^{K^{\prime }}_{k^{\prime }=0}(y^{returns}_i(t+k^{\prime })-\mu _i)^2}\sqrt {\sum ^{K^{\prime }}_{k^{\prime }=0}(y^{returns}_j(t+k^{\prime })-\mu _j)^2}}. \end{equation}\)

For the main training losses, we compute losses between the forecasts above and respective ground-truths, i.e., actual means, volatilities, and correlations over the horizon \([t,t+K^{\prime }]\) with root mean squared loss (RMSE) across all companies \(v_i \in V\): (10) \(\begin{equation} \begin{split} \mathcal {L}_{main} = & \, \mathcal {L}_{RMSE}\left(Y^{returns}_{mean}(t), \hat{Y}^{returns}_{mean}(t)\right) \\ & + \mathcal {L}_{RMSE}\left(Y^{returns}_{vol}(t), \hat{Y}^{returns}_{vol}(t)\right) \\ & + \mathcal {L}_{RMSE}\left(Y^{returns}_{corr}(t), \hat{Y}^{returns}_{corr}(t)\right). \end{split} \end{equation}\)

For the intermediate forecasts, we filter out subsets of the stocks for each training iteration, one for forecasts of means and one for forecasts of volatilities, i.e., \(V^{\prime }_{mean} \subseteq V\) and \(V^{\prime }_{vol} \subseteq V\), with the most significant ex-post response in the horizon period [\(t, t+K^{\prime }\)] to online news in the window period [\(t-K,t-1\)]. This enables GLT to capture the most significant ex-post effects of online news and improve its extraction and learning of global textual information relevant to each company. We further design an adaptive learning process similar to curriculum learning [45] but undertaken at a much earlier stage of the GLAM model. Curriculum learning is a training strategy that imitates the way humans learn by gradually increasing the difficulty of the data samples used to train a model [4]. In the context of this article, more significant ex-post responses to online news, i.e., higher than average means and volatilities of returns in the horizon period [\(t,t+K^{\prime }\)], represent easier training data samples for the extraction of global textual information relevant to specific companies. We use thresholds \(\tau _{mean}\) and \(\tau _{vol}\) to select companies for \(V^{\prime }_{mean}\) and \(V^{\prime }_{vol}\), respectively. \(\tau _{mean}\) and \(\tau _{vol}\) are computed as follows: (11) \(\begin{equation} \tau _{mean} = M_{mean} - \gamma \times (M_{mean} - \mu _{mean}), \end{equation}\) (12) \(\begin{equation} \tau _{vol} = M_{vol} - \gamma \times (M_{vol} - \mu _{vol}), \end{equation}\) where \( \begin{equation*} \mu _{mean}=\frac{1}{|V|}\sum _{v_i \in V} y_{mean,i}^{returns}(t), \end{equation*} \) \( \begin{equation*} \mu _{vol}=\frac{1}{|V|}\sum _{v_i \in V} y_{vol,i}^{returns}(t). \end{equation*} \) The maximum of return means and volatilities \(M_{mean}\) and \(M_{vol}\) are defined by: \(M_{mean}=\max _{v_i\in V} y_{mean,i}^{returns}(t)\) and \(M_{vol}=\max _{v_i\in V} y_{vol,i}^{returns}(t)\). The parameter \(\gamma = \frac{e}{E} + \eta\) is updated based on the training epochs \(e \in \lbrace 1, \ldots , E\rbrace\), where E refers to the total number of training epochs. The \(\eta\) hyper-parameter is set to a fraction, say, 0.5, which represents a base proportion of the company nodes to be utilized at the start of training.

The subset of company nodes for the most significant ex-post returns means and the subset of company nodes for the most significant ex-post returns volatilities in the horizon period \([t,t+K^{\prime }]\) are hence: \( \begin{equation*} V^{\prime }_{mean}=\left\lbrace v_i|y_{mean,i}^{returns}(t) \gt \tau _{mean}\right\rbrace \end{equation*} \) \( \begin{equation*} V^{\prime }_{vol}=\left\lbrace v_i|y_{vol,i}^{returns}(t) \gt \tau _{vol}\right\rbrace . \end{equation*} \)

Hence, for the auxiliary training losses, we compute the auxiliary losses for mean and volatilities between the intermediate forecasts from GLT, i.e., \(\hat{Y}^{\prime returns}_{mean}(t)\) and \(\hat{Y}^{\prime returns}_{vol}(t)\) and the ground-truths with RMSE for the subsets of company nodes \(v_i \in V^{\prime }_{mean}\) and \(v_i \in V^{\prime }_{vol}\), respectively: (13) \(\begin{equation} \begin{split}\mathcal {L}_{aux} & = \mathcal {L}_{RMSE}\left(Y^{\prime returns}_{mean}(t), \hat{Y}^{\prime returns}_{mean}(t)\right) + \mathcal {L}_{RMSE}\left(Y^{\prime returns}_{vol}(t), \hat{Y}^{\prime returns}_{vol}(t)\right). \end{split} \end{equation}\)

GLAM can be trained with the objective of minimizing total main and auxiliary training losses, i.e., a simple addition of \(\mathcal {L}_{main}\) and \(\mathcal {L}_{aux}\). However, to adaptively balance between these losses, we introduce the \(\alpha\) hyper-parameter. The \(\alpha\) hyper-parameter balances between the main and auxiliary training objectives with respect to the means and volatilities of returns. A higher weight is placed on the auxiliary training losses \(\mathcal {L}_{aux}\) during the initial training epochs to enable the GLT to extract relevant global information well first, before higher weights are placed on the main training losses for forecasts of means and volatilities \(\mathcal {L}_{RMSE}(Y^{returns}_{mean}(t), \hat{Y}^{returns}_{mean}(t)) + \mathcal {L}_{RMSE}(Y^{returns}_{vol}(t), \hat{Y}^{returns}_{vol}(t))\) during the later training epochs. The total loss is defined as: (14) \(\begin{equation} \begin{split}\mathcal {L}_{total} &= \alpha \left(\mathcal {L}_{RMSE}\left(Y^{returns}_{mean}(t), \hat{Y}^{returns}_{mean}(t)\right) + \mathcal {L}_{RMSE}\left(Y^{returns}_{vol}(t), \hat{Y}^{returns}_{vol}(t)\right)\right) \\ & \quad +\mathcal {L}_{RMSE}\left(Y^{returns}_{corr}(t), \hat{Y}^{returns}_{corr}(t)\right) + (1-\alpha) \mathcal {L}_{aux} \end{split} \end{equation}\) where \(\alpha = \frac{e}{E}\).

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

4.1 Datasets

We conduct experiments with four datasets, comprising global textual information of online news articles from two popular financial news portals on the web and numerical information of daily stock market price-related information of two stock markets—NYSE and NASDAQ—from 2015 to 2019.

The two online news article sources are: (i) Investing news datasets (IN)1; and (ii) Benzinga news datasets (BE).2 The datasets contain news articles and commentaries collected from Investing and Benzinga investment news portals, which are drawn from a wide range of mainstream providers, analysts, and blogs, such as Seeking Alpha and Zacks.

For the local numerical information, we collected daily stock market price-related information—opening, closing, low and high prices, and trading volumes—of the two stock markets—NYSE (NY) and NASDAQ (NA)—from the Center for Research in Security Prices. We filter out stocks from NYSE and NASDAQ that are not traded in the respective time periods and whose stock symbols are not mentioned in any articles for the respective news article sources. We could have included articles not covering these stocks, as GLAM is able to extract relevant global textual information, but we restrict the choice of stocks and articles for a fair comparison with previous models (e.g., FAST [43] and HAN [52]) that are designed to only capture local textual news information, i.e., news information associated with specific companies in the target company list.

Following the earlier works [1, 17], we utilize relationships between companies extracted from Wikidata knowledge graphs for the inter-company relationships of the company network \(G=(V,E,X)\) from Wikidata dumps dated January 7, 2019. Wikidata is chosen, as it is one of the largest and most active collaboratively constructed KGs. Companies such as Google, Apple, and Microsoft are present within the Wikidata KG as entities, and relationships between them, e.g., Alphabet as a parent company of Google, and Apple and Microsoft belong to the same industry sector, can be extracted from Wikidata. We adopted five first-order relationship types and 52 second-order relationship-types identified by Feng et al. [17] to extract inter-company relationships from the Wikidata dumps dated January 7, 2019. First-order relationship-type refers to a relationship extracted directly from a knowledge graph relation, e.g., parent organization relation in Wikidata knowledge graph where company A is the parent of company B. Second-order relationship-type involves the use of two Wikidata relations to create an inter-company relationship. For example, a second-order relationship of company A sharing key management with another company B can be constructed from two Wikidata relations: A having a board member M and B having M as its chief executive officer. Another second-order relationship of companies A and B belonging to the same industry sector can be constructed from two Wikidata relations: A in industry I and B in industry I, too. The earliest Wikidata dumps were from 2014. We used Wikidata dumps from January 7, 2019, and not earlier, as we found that knowledge graphs extracted from earlier Wikidata dumps were too sparse to be useful for our experiments. We did not use more recent Wikidata dumps to avoid overlap with the time window of testing data sets. Following Reference [1], we also use a pre-trained Wikipedia2Vec [56] embedding model to pre-encode textual news to capture the rich knowledge present within the Wikipedia knowledge base, as it offers a relatively compact representation with dimension of 100, while giving reasonably good performance compared to other pre-trained encoders, as it captures knowledge-based semantics. The representations of each news article are the average word embeddings of each news article generated with the pre-trained Wikipedia2Vec embedding model. Wikipedia2Vec generates representations of words and entities based on corresponding pages in Wikipedia, placing similar words and entities close to one another in the representational space.

The coverage of these datasets—across five years, with more than 1.5M articles and 2,000 companies, and more than 50 types of inter-company relationships—is extensive and provides strong assurance to our experiment findings. We combine them into four datasets (across two news article sources and two stock markets), each of which covers different number of companies, number of relationship-types, and news sources, as depicted in Table 1. The number of companies included in these datasets is relatively large or comparable to most other related works, e.g., References [13, 43] cover less than 100 companies, [1, 17] cover around 1,000–2,000 companies, while Reference [23] similarly covers more than 2,000 companies.

Table 1.
IN-NYIN-NABE-NYBE-NA
No. articles221,5131,377,098
No. company nodes3744022,2402,514
No. relationship-types58364634
No. relationships3,2551,5116,4364,986

Table 1. Overview of Datasets

To obtain the labelled data samples, we adopt a sliding window approach [53] to extract the numerical and textual input features in the window \([t-K,t-1]\) and returns-related labels, i.e., the ground-truth means, volatilities, and correlations of returns in the horizon \([t,t+K^{\prime }]\). For each of the four datasets, we obtain a total of 1,257 data samples and divide these samples into non-overlapping training/validation/testing sets in the ratios 0.7/0.15/0.15 for all experiments, as shown in Figure 5.

Fig. 5.

Fig. 5. Using a fixed sliding window to extract input features in the window \([t-K,t-1]\) and labels in the horizon \([t,t+K^{\prime }]\) to obtain labelled data samples and splitting into training, validation, and testing sets.

4.2 Tasks and Metrics

We compare GLAM with state-of-the-art baselines on three predictive tasks: forecasting of (i) means, (ii) volatilities, and (iii) correlations of stock price percentage returns. We use RMSE, mean absolute error (MAE) and symmetric mean absolute percentage error (SMAPE) as metrics. RMSE and MAE are common scale-dependent metrics used to evaluate forecasting performance, with RMSE being more sensitive to outliers than MAE. SMAPE is a commonly used scale-independent metric defined as: (15) \(\begin{equation} SMAPE = \frac{100\%}{n} \sum ^{n}_{i=1} \frac{|Y^{returns}_i(t) - \hat{Y}^{returns}_i(t)|}{(|Y^{returns}_i(t)| + |\hat{Y}^{returns}_i(t)|)/2}, \end{equation}\) where n is the number of observations. We choose SMAPE instead of mean absolute percentage error (MAPE), as SMAPE gives equal importance to both under- and over-forecasts required in this evaluation context, while MAPE favors under-forecast.

4.3 Baselines and Settings

We compare GLAM against the classical Vector AR [34] model that captures numerical information and makes forecasts in an auto-regressive manner, as well as state-of-the-art baselines (see Section 2): TST [60], which captures numerical information with a conventional transformer encoder; HAN [52], which captures local textual information with two sets of attention mechanisms; SE [13], which captures global textual information with bidirectional GRUs; FAST [43], which captures local textual information with Time-Aware LSTMs[3]; RSR [17], which captures numerical information and heterogeneous inter-company relationships with a GCN-based model; KECE [1], which captures numerical, global textual information, and homogeneous inter-company relationships with a GAT-based model. We also adapt HGT [22] for time-series attributes by adding a GRU at the start to first encode numerical time-series information before network encoding with HGT. This model is referred to as GRU-HGT. Table 2 provides an overview of the data captured by GLAM and the baselines. For every deep learning baseline, we add a fully connected layer to forecast means, volatilities, and correlations of percentage stock returns. For the classical Vector AR model, we compute the return forecasts in the horizon in an auto-regressive manner and compute the means, volatilities, and correlations of these forecasts.

Table 2.
ModelVector ARTSTHANSEFASTRSRKECEGRU-HGTGLAM
Homogeneous Network×××××
Heterogeneous Network××××××
Local information×
Global information××××××
  • Homogeneous networks are a special case of heterogeneous networks, hence GLAM and baseline models that can capture heterogeneous networks can also be utilized to capture homogeneous networks.

Table 2. Overview of Data that Can Be Captured by GLAM Model and Baselines

  • Homogeneous networks are a special case of heterogeneous networks, hence GLAM and baseline models that can capture heterogeneous networks can also be utilized to capture homogeneous networks.

We set the default window and horizon periods \(K=20\) and \(K^{\prime }=10\) days based on experiments with different periods \(K, K^{\prime } \in \lbrace 5,10, 20,60\rbrace ,\) which correspond to a trading week, fortnight, month, and quarter. Differences in performance between GLAM and baselines were generally consistent across all window and horizon periods. \(K=20\) corresponds to a trading month, and \(K^{\prime }=10\) days corresponds to a global regulatory requirement for VaR computations, which we examined in a subsequent set of case-studies (see Section 7). Across all models, dimensions of hidden representations are fixed at 100 and two layers (L = 2) utilized, where applicable. \(\eta\) for GLAM is set to 0.5 based on experiments with different \(\eta \in \lbrace 0.25,0.5,0.75\rbrace\). An Adam [26] optimizer with a learning rate of 1e-3 with a cosine annealing scheduler is used. Models are implemented in Pytorch [39] and trained for 100 epochs on a 3.60 GHz AMD Ryzen 7 Windows desktop with NVIDIA RTX 3090 GPU and 64 GB RAM. Training GLAM, which has around 6e5 to 8e5 parameters (depending on the datasets), takes around 12 to 16 hours.

Skip 5FORECASTING RESULTS Section

5 FORECASTING RESULTS

Table 3 sets out the results of the forecasting experiments. For each metric, we indicate the best results in boldface and underline the second best results.

Table 3.
IN-NYIN-NABE-NYBE-NA
RMSEMAESMAPERMSEMAESMAPERMSEMAESMAPERMSEMAESMAPE
Means
Vector AR0.15090.04131.77180.06180.03721.43570.07650.02011.66690.15260.02821.5673
TST0.06890.01331.48600.03230.01481.33490.06620.01421.52160.15210.02881.5511
HAN0.06890.01341.69050.03220.01441.37040.06640.01581.52380.14990.02711.5560
SE0.06890.01341.46460.03230.01441.39190.06760.01581.59520.16660.03391.5445
FAST0.06890.01341.44420.03290.01621.29210.06630.01441.52850.14960.02761.5599
RSR0.06900.01351.37850.03270.01561.31280.06640.01631.50500.14990.03001.5581
KECE0.06880.01341.40140.03240.01521.29650.06620.01521.64110.14650.03011.6610
GRU-HGT0.06890.01341.38530.03220.01471.42990.06630.01451.53110.14980.02871.5540
GLAM0.04910.01071.20310.02230.01161.24080.04870.01181.44490.11250.02081.4959
Volatilities
Vector AR0.36050.08710.75690.11630.05420.69560.22870.06621.04810.47750.11831.1401
TST0.21770.04820.62250.11550.05870.67730.22020.06271.01810.48270.12001.1521
HAN0.21800.04860.62260.11540.05790.67190.22230.06631.03610.48140.11741.1682
SE0.21750.04850.63190.11480.05580.65470.22450.06881.02090.47950.11431.1286
FAST0.21790.04850.62280.11450.05610.66380.22170.06331.02600.47890.11551.1594
RSR0.21810.04870.62320.11610.05900.68300.22400.07241.04880.48180.12531.1748
KECE0.21770.04830.62390.11930.06510.71670.21860.05911.04860.46190.10051.1545
GRU-HGT0.21760.04830.62700.11450.05770.66990.22320.06841.03430.48070.11741.1374
GLAM0.14350.04140.61170.08350.05070.65560.16320.06011.01700.35780.09461.0836
Correlations
Vector AR0.74430.59031.58690.71220.57041.62410.53060.32651.76580.45400.25341.8834
TST0.49530.42221.50090.49130.41841.54020.38990.27681.72200.33790.21771.8082
HAN0.49430.42221.48610.49140.41851.53560.39020.27771.72430.33860.21741.7986
SE0.50900.43081.54560.49800.42081.51670.40230.28441.72240.33950.22121.7854
FAST0.49580.42231.50350.49170.41761.50560.38820.27521.71980.33710.21671.7996
RSR0.49270.42001.42990.49400.42011.51450.39030.27801.72330.33980.22061.7943
KECE0.49580.42271.51650.49160.41841.52680.38910.26171.70700.33810.21861.8005
GRU-HGT0.49650.42341.52870.49330.41941.51930.38720.27701.72900.33950.22311.7874
GLAM0.40250.32481.12210.41690.34371.23320.33550.23821.58440.30600.19791.7018
  • Lower better for all metrics. Best model(s) in bold; second-best model(s) underlined.

Table 3. Forecast Results

  • Lower better for all metrics. Best model(s) in bold; second-best model(s) underlined.

On the task of forecasting means, GLAM clearly outperforms all baselines. The dispersion in model performances for IN datasets is more narrow than for BE datasets. TST, RSR, and KECE, which utilize numerical information, generally tend to perform better than the models that only utilize textual information.

On the tasks of forecasting volatilities, GLAM again outperforms baselines on most metrics. Compared to the task of forecasting means, the dispersion in model performance for forecasting volatilities is more significant, which could be due to the task of forecasting volatilities being more difficult and requiring textual information to be captured more effectively. Hence, we observe baselines such as SE, FAST, and KECE, which capture textual information performing better on this task. This could be due to textual news information containing information that may be indicative of subsequent periods of increased volatility.

On the task of forecasting correlations, GLAM outperforms baselines most significantly. This is likely due to its utilization of heterogeneous network information. We similarly see RSR and KECE, baselines that utilize network information, performing better than baselines here. GRU-HGT, which also utilizes network information, does not perform as well, but this could be due to its inability to capture sequential information effectively with the simple GRU extension.

In general, GLAM outperforms all baselines by a significant margin on all tasks. Different baselines perform better on different tasks based on the nature of information that they capture. Performance differences between GLAM and baselines are more significant for the larger BE datasets than for the IN datasets due to the larger volume of news textual information. The differences in performances between GLAM and baselines are more pronounced for volatilities and correlations forecasting than means forecasting, as these are harder tasks that require the model to capture global textual news information and the propagation of news effects between companies based on heterogeneous relationships, which are key features of the GLAM model.

Skip 6ABLATION STUDIES Section

6 ABLATION STUDIES

To further substantiate the earlier findings, we conduct ablation studies on GLAM. Table 4 shows the results of ablation studies for GLAM on IN-NY and IN-NA for forecasting means, volatilities, and correlations. We choose IN-NY and IN-NA datasets for ablation studies to represent the two markets, as the difference in the number of relationship-types and number of relationships is the most distinct between these two datasets using the same source of news. The impact of different changes to the GLAM model vary for different tasks but are generally consistent across both datasets.

Table 4.
IN-NYIN-NA
RMSEMAESMAPERMSEMAESMAPE
Means
GLT only (L = 2)0.05080.01091.20460.02440.01351.3423
GLT only (L = 1)0.05160.01251.21020.02560.01451.3446
GLT only (L = 3)0.05080.01081.20930.02410.01341.3413
w/o. inner \(W_{att}\)0.04910.01071.20420.02250.01181.2415
w/o. GLT guided learning loss0.05180.01261.20540.02430.01261.2568
w. subset final forecasts0.05100.01171.20910.03420.02031.3221
w. \(\eta\) = 0.250.05250.01071.20340.02260.01161.2494
w. \(\eta\) = 0.750.05270.01071.20450.02230.01171.2447
w/o. adaptive \(\alpha\)0.05110.01171.20820.02320.01281.2457
w. mean loss0.05050.01131.20390.02260.01231.2463
w. vol. loss0.14200.04791.58690.15940.13271.8501
w. corr. loss1.48641.48011.99080.11540.07901.7290
GLAM0.04910.01071.20310.02230.01161.2408
Volatilities
GLT only (L = 2)0.14920.04180.61960.08540.05280.6616
GLT only (L = 1)0.14950.04200.61940.08630.05340.6626
GLT only (L = 3)0.14820.04130.61990.08540.05230.6613
w/o. inner \(W_{att}\)0.14350.04140.61170.08360.05090.6566
w/o. GLT guided learning loss0.14500.04250.61990.08440.05200.6567
w. subset final forecasts0.14460.04210.61250.10420.08870.8508
w. \(\eta\) = 0.250.14500.04140.61230.08350.05080.6569
w. \(\eta\) = 0.750.14500.04140.61230.08350.05080.6559
w/o. adaptive \(\alpha\)0.14450.04240.61270.08430.05240.6643
w. mean loss0.30400.06920.83910.14230.07500.8220
w. vol. loss0.14360.04140.61970.08360.05080.6558
w. corr. loss1.01260.95471.99900.14260.07190.8651
GLAM0.14350.04140.61170.08350.05070.6556
Correlations
GLT only (L = 2)0.50360.42641.53200.49140.41851.5392
GLT only (L = 1)0.50330.42651.53760.49190.41901.5392
GLT only (L = 3)0.50460.42661.53820.49120.41881.5381
w/o. inner \(W_{att}\)0.40500.32751.13710.41990.35551.2612
w/o. GLT guided learning loss0.40500.32711.13370.41930.34611.2382
w. subset final forecasts0.40300.32571.12350.44520.37801.3376
\(\eta\) = 0.250.40380.32601.14960.41980.34661.2359
\(\eta\) = 0.750.40370.32591.14910.41690.34371.2336
w/o. adaptive \(\alpha\)0.40350.32681.12310.41730.34521.3307
w. mean loss0.52230.44371.97590.51100.43501.9537
w. vol. loss0.52240.44371.97970.51080.43481.9576
w. corr. loss1.27131.16971.64810.41750.34421.2418
GLAM0.40250.32481.12210.41690.34371.2332
  • Lower better for all metrics. Best model(s) in bold; second-best model(s) underlined.

Table 4. Ablation Studies

  • Lower better for all metrics. Best model(s) in bold; second-best model(s) underlined.

When we do not capture heterogeneous network information and exclude the HNE module, i.e., (GLT only (L = 2)), the drop in performance for the correlation forecasting task is the most significant, which highlights the importance of capturing heterogeneous network information effectively. Nonetheless, we can also see that the GLT module alone (with L = 2 as set in the GLAM model) is already able to outperform most of the baselines on the tasks of forecasting means and volatilities. We observe similar changes in performance when we do not utilize \(W_{att}\) (w/o. inner \(W_{att}\)) as the inner weight when forecasting correlations. While the impact of not utilizing \(W_{att}\) is not significant for the tasks of forecasting means and volatilities, there is a material drop in performance on the task of forecasting correlations.

We also explore the effects of changing the number of layers in GLT, i.e., (GLT only (L = 1) and GLT only (L = 3)). In general, increasing the number of layers leads to better performance. However, the selected hyper-parameter of L = 2 for GLT in GLAM achieves a good balance between model complexity and performance.

The drop in performance when we exclude the guided learning losses (w/o. GLT guided learning loss), i.e., excluding \(\mathcal {L}_{aux}\) from the training objective, is more apparent for the tasks of forecasting means and volatilities. This demonstrates the importance of the proposed approach of using intermediate forecasts for early guidance when learning GLT parameters, which enables GLT to focus on learning the relevance of global news information by utilizing significant ex-post responses of stock prices to news. An alternative approach would be to compute the losses for the final forecasts only on the subset of stocks with such significant ex-post responses (w. subset final forecasts). However, such an approach leads to poorer performance, particularly for the IN-NA dataset. This could be due to the generally more volatile price movements of stocks listed on the NASDAQ market.

We also explore alternative hyper-parameter settings for \(\eta\). While the differences in performance for different \(\eta\) (w. \(\eta\) = 0.25 and \(\eta\) = 0.75), are not significant, the chosen \(\eta\) = 0.5 for GLAM generally gives the best performance across most metrics. We also explore not using the \(\alpha\) parameter to adaptively balance between the different losses (w/o. adaptive \(\alpha\)) and find that it leads to worse performance across most metrics.

When we vary the multitask aspect of GLAM by training on mean, volatility, or correlation forecast losses only (i.e., w. mean loss only, w. vol. loss only, w. corr. loss only), we see significant drops in performance, even on tasks that correspond to the training loss, e.g., performance of mean forecasts when we train only on mean loss is poorer than when we train GLAM with multiple tasks.

In general, we see that the key features of GLAM work together to enable it to achieve the best performance on the multiple tasks. The GLT module and using intermediate forecasts of means and volatilities for early guidance when learning GLT parameters improves forecasts of means and volatilities, while the HNE module and utilizing \(W_{att}\) as an inner weight for correlation forecasts improves performance for the correlation forecasts.

Skip 7APPLICATION CASE STUDIES Section

7 APPLICATION CASE STUDIES

In this section, we use model forecasts for important investment and risk management applications to evaluate the quality of forecasts.

7.1 Portfolio Allocation

Investment portfolio allocation is an important task for many financial institutions. The aim of investment portfolio allocation is to optimize the proportion of capital invested in each stock (also known as asset) in a portfolio, by finding an optimal set of weights \(\mathbb {W}\) that determine how much capital to invest in each stock, so portfolio returns can be maximized while minimizing portfolio risk. In this article, we adopt the risk aversion formulation [15] of the mean-variance risk minimization model by Markowitz [35], which models both portfolio return and risk expressed as mean (\(\mu\)) and co-variances (\(\Sigma\)) of return, respectively. Under the risk aversion formulation, the classical mean-variance risk minimization model by Markowitz [35] is re-formulated to maximize the risk-adjusted portfolio return by optimizing the asset allocation \(\mathbb {W}\), a \(|V|\) dimensional vector:

(16)
subject to \(\mathbb {W}^\intercal {\bf 1}=1\). \(\lambda\), known as the Arrow-Pratt risk aversion index, is used to express an investor’s risk preferences and is typically in the range of 2 to 4 [15]. In our experiments, we set \(\lambda =2\). We observe that higher \(\lambda\) values reduce returns across all models, but the relative differences in returns between models generally remain consistent. In this article, we use the forecasted means of asset returns for \(\mu\) and compute \(\Sigma\) with the forecasted volatilities and correlations of asset returns for the selected horizon period [\(t,t+K^{\prime }\)] defined as follows: (17) \(\begin{equation} \tilde{\mu }= \hat{Y}^{returns}_{mean}(t), \end{equation}\) (18) \(\begin{equation} \tilde{\Sigma } = D(t) \cdot \hat{Y}^{returns}_{corr}(t) \cdot D(t), \end{equation}\) where \(D(t)\) is the \(|V| \times |V|\) diagonal (and thus symmetric) matrix filled with \(\hat{Y}^{returns}_{vol}(t)\) along the diagonals and 0 otherwise. We choose to forecast correlations of asset returns over the selected horizon period \([t,t+K^{\prime }]\) instead of directly forecasting co-variances as the co-variances need to be positive semi-definite (PSD) so the matrix is invertible [14], which is important for applications such as portfolio optimization. Forecasting co-variances directly does not guarantee PSD. We instead forecast volatilities and correlations separately and compute the co-variance matrix using the forecasted volatilities and correlations.

This application can be viewed as a predictive task as we use the multimodal and network information (as applicable) from the window period \([t-K,t-1]\) to make forecasts of mean (\(\tilde{\mu }\)) and correlation (\(\tilde{\Sigma }\)) of asset returns over the future horizon \([t,t+K^{\prime }],\) which are in turn used to determine the asset allocation weights \(\mathbb {W}\). \(\mathbb {W}\) represents an investment portfolio with returns realized in this future horizon defined as: \(E^{real}=\mathbb {W}^\intercal R^{real}\), where \(R^{real}\) is a vector of realized percentage stock returns over the future horizon.

Given that the aim is to maximize portfolio returns while minimizing portfolio risk (volatility), we choose risk-adjusted realized portfolio returns over the future horizon \([t,t+K^{\prime }]\) as the evaluation metric, defined as: \(\tilde{E} = \frac{E^{real}}{\sigma ^{real}}\), where \(\sigma ^{real}\) is portfolio return volatility in the future horizon \([t,t+K^{\prime }]\). Portfolio return volatility is defined as one standard deviation of the portfolio returns over the future horizon \([t,t+K^{\prime }]\) and is computed as \(\sigma ^{real} = \sqrt {\mathbb {W}^\intercal \Sigma ^{real}\ \mathbb {W}}\), where \(\Sigma ^{real}\) are the co-variances of realized percentage stock returns over the same future horizon.

For this application, the datasets are similarly divided into non-overlapping training/validation/testing sets in the ratios 0.7/0.15/0.15, as described in Section 4.1, and we evaluate performance based on the average of the risk-adjusted realized portfolio returns (\(\tilde{E}\)) across future horizon periods in the testing set.

Table 5 depicts results for the IN-NY and IN-NA datasets for the portfolio allocation application. We see that portfolios constructed using GLAM’s forecasts for the IN-NY dataset achieves the highest average risk-adjusted returns of 1.67%, which is 15% better than the second highest results from KECE; and portfolios constructed using GLAM’s forecasts for the IN-NA dataset achieves the highest average risk-adjusted returns of 2.32%, which is 56% better than the second highest results from GRU-HGT. Baselines utilizing textual information or inter-company relationships (FAST, RSR, KECE, and GRU-HGT) generally perform better on this application, demonstrating the value of capturing textual and relational information for selecting optimal portfolios.

Table 5.
IN-NYIN-NA
\(\tilde{E}\)VaR Breaches% VaR Breaches.\(\tilde{E}\)VaR Breaches% VaR Breaches
Vector AR0.61%2211.6%0.40%1910.1%
TST1.37%4021.2%0.48%3216.9%
HAN0.66%3418.0%0.10%3518.5%
SE1.32%2814.8%0.95%2814.8%
FAST0.64%3619.1%1.26%73.7%
RSR1.42%4624.3%1.21%84.2%
KECE1.45%5931.2%1.21%126.4%
GRU-HGT1.10%3015.9%1.49%3619.1%
GLAM1.67%73.7%2.32%10.5%
  • Higher better for average risk-adjusted percentage returns \(\tilde{E}\). Lower better for number of VaR breaches (VaR Breaches) & percentage of VaR breaches (% VaR Breaches). % VaR Breaches is computed by dividing the number of VaR breaches by the number of data samples in the testing dataset.

Table 5. Portfolio Allocation and VAR

  • Higher better for average risk-adjusted percentage returns \(\tilde{E}\). Lower better for number of VaR breaches (VaR Breaches) & percentage of VaR breaches (% VaR Breaches). % VaR Breaches is computed by dividing the number of VaR breaches by the number of data samples in the testing dataset.

7.2 Value-at-Risk (VaR)

VaR [32] is a key measure of risk used in financial institutions for the measurement, monitoring and management of financial risk. Financial regulators require important financial institutions such as banks to measure and monitor their VaR over a \(K^{\prime }\) = 10 day horizon and maintain capital based on this VaR as loss buffers. VaR measures the loss that an institution may face in the pre-defined horizon with a probability of \(p\%\). For example, if the 10 day 95% VaR is $1,000,000, then it means that there is a p = 5% probability of losses exceeding $1,000,000 over a 10-day horizon.

VaR can be computed as a multiple of the portfolio’s volatility: (19) \(\begin{equation} VaR(p) = - \phi ^{-1}(p) \times \sigma , \end{equation}\) where \(\sigma\) is the portfolio volatility, and \(\phi\) is the inverse cumulative distribution function of the standard normal distribution, for example, if \(p=5\%,\) then \(\phi ^{-1}(p)=1.645\). Whenever realized portfolio losses (i.e., negative realized portfolio returns \(E^{real}\)) is greater than the forecasted VaR, it is regarded as a VaR breach, i.e., \(E^{real} \le VaR(p)\).

For this application, the portfolio is constructed based on the approach described for the portfolio allocation application at each timestep. This mimics a real-world scenario where financial institutions continually update their portfolios based on market conditions. To evaluate the baseline models, we use the forecasted portfolio volatility \(\tilde{\sigma } = \sqrt {\tilde{\Sigma }}\), where \(\tilde{\Sigma }\) is computed using the forecasted volatilities and correlations of asset returns as defined in Equation (18). Similar to the portfolio allocation application, this can also be viewed as a predictive task, as we are using multimodal and network information (as applicable) from the window period \([t-K,t-1]\) to make forecasts over the future horizon \([t,t+K^{\prime }]\) and using these forecasts to determine the VaR in the future horizon. We evaluate model performances by counting the total number of 95% VaR breaches, i.e., where the realized portfolio loss is greater than the forecasted VaR in the testing dataset (using the same training/validation/testing sets as described in Section 4.1). We choose the 95% VaR for our experiments, as it is a common confidence level used by banks to monitor their risks. Models that are able to make accurate forecasts of VaR should have less VaR breaches.

Table 5 depicts results for the IN-NY and IN-NA datasets for the VaR application. We see that GLAM outperforms baselines with significantly less VaR breaches. Similar to the portfolio allocation application, we observe baselines that utilize textual information or inter-company relationships (SE, FAST, RSR, and KECE) generally performing better on this application.

Skip 8DISCUSSION Section

8 DISCUSSION

Our experimental results demonstrate the value of the proposed time-sensitive guided global-local transformer in extracting relevant global information for forecasting on multiple tasks. We see that the proposed GLAM model outperforms other baselines such as SE and KECE, which also extract and utilize relevant global textual information. In the ablation studies, we observe that even without the use of the heterogeneous network information and the HNE module, the GLT module itself is already able to perform better than the baselines on the tasks of forecasting means and volatilities. Further, we also show that isolating and utilizing significant ex-post stock price responses to global textual information in the window period improves the extraction of relevant global textual information. We also demonstrated the value of capturing heterogeneous network relationships and using a learned set of attention weights \(W_{att}\) for forecasting correlations. The use of attention weights \(W_{att}\) enables heterogeneous inter-company relationships to be captured and facilitates better performance on the task of forecasting correlations.

We show that the proposed model features are valuable in investment and risk management applications, which differs from the more common and simpler task of forecasting stock prices or returns for trading decisions. GLAM forecasts the price dynamics of a portfolio of multiple stocks over a longer future horizon, i.e., expected returns, volatilities, and correlations of stocks over a longer term horizon, which are important in enabling investment and risk managers to make effective decisions over a longer term horizon. Importantly, we also see that designing a model that can be used in a multivariate multitask setting for investment and risk management applications has other potential advantages, as forecasting in a multivariate multitask setting enables complementary information from other variables and related tasks to be used to improve overall forecasting performance and also lowers the risk of over-fitting on any one task.

The framework proposed in this article could be potentially extended to capture other information sources, such as other types of global and local information, e.g., local social media information such as tweets from the company’s social media account and global economic indicators, e.g., gross domestic product of countries of the company’s key markets; as well as other static and dynamic inter-company relationships (i.e., inter-company relationships captured at different timestamps), e.g., from domain experts, DBPedia, GDELT.

Skip 9CONCLUSION AND FUTURE WORK Section

9 CONCLUSION AND FUTURE WORK

In this article, we designed GLAM, a model that comprises a time-sensitive global-local transformer to learn relevant global online text information with local numerical information and sequentially encode such multimodal information; and an attention-based heterogeneous network encoder to leverage heterogeneous inter-company relationships. Auxiliary channels and an adaptive learning strategy are also utilized in GLAM to facilitate intermediate guided learning of the parameters of the time-sensitive global-local transformer and heterogeneous network encoder modules. The model performs strongly on three forecasting tasks and two real-world applications, demonstrating the value of the proposed model features and learning strategies. The datasets used are extensive and provide strong assurance on the validity of the results across different companies and textual information. Future work could extend GLAM to different types of global and local information, as well as other static and dynamic inter-company relationships.

Footnotes

REFERENCES

  1. [1] Ang Gary and Lim Ee-Peng. 2021. Learning knowledge-enriched company embeddings for investment management. In ACM International Conference on AI in Finance.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Bai Shaojie, Kolter J. Zico, and Koltun Vladlen. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Computing Research Repository (CoRR) abs/1803.01271 (2018).Google ScholarGoogle Scholar
  3. [3] Baytas Inci M., Xiao Cao, Zhang Xi, Wang Fei, Jain Anil K., and Zhou Jiayu. 2017. Patient subtyping via time-aware LSTM networks. In ACM International Conference on Knowledge Discovery and Data Mining (KDD’17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bengio Yoshua, Louradour Jérôme, Collobert Ronan, and Weston Jason. 2009. Curriculum learning. In International Conference on Learning Representations (ICML’09).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bollerslev Tim. 1986. Generalized autoregressive conditional heteroskedasticity. J. Economet. 31, 3 (Apr.1986), 307327.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Borovykh Anastasia, Bohte Sander, and Oosterlee Cornelis W.. 2017. Conditional time series forecasting with convolutional neural networks. In Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence, 729–730.Google ScholarGoogle Scholar
  7. [7] Cen Yukuo, Zou Xu, Zhang Jianwei, Yang Hongxia, Zhou Jingren, and Tang Jie. 2019. Representation learning for attributed multiplex heterogeneous network. In ACM International Conference on Knowledge Discovery and Data Mining (KDD’19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Hao, Xiao Keli, Sun Jinwen, and Wu Song. 2017. A double-layer neural network framework for high-frequency forecasting. ACM Trans. Manag. Inf. Syst. 7, 4 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Jinyin, Xu Xuanheng, Wu Yangyang, and Zheng Haibin. 2018. GC-LSTM: Graph convolution embedded LSTM for dynamic link prediction. Appl. Intell. 52, 7 (2022), 7513–7528.Google ScholarGoogle Scholar
  10. [10] Chong Eunsuk, Han Chulwoo, and Park Frank C.. 2017. Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Syst. Applic. 83 (2017), 187205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Ding Xiao, Zhang Yue, Liu Ting, and Duan Junwen. 2015. Deep learning for event-driven stock prediction. In International Joint Conference on AI (IJCAI’15).Google ScholarGoogle Scholar
  12. [12] Dong Yuxiao, Chawla Nitesh V., and Swami Ananthram. 2017. MetaPath2vec: Scalable representation learning for heterogeneous networks. In ACM International Conference on Knowledge Discovery and Data Mining (KDD’17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Du Xin and Tanaka-Ishii Kumiko. 2020. Stock embeddings acquired from news articles and price history, and an application to portfolio optimization. In Annual Meeting of the Association for Computational Linguistics (ACL’20).Google ScholarGoogle Scholar
  14. [14] Engle Robert. 2002. Dynamic conditional correlation: A simple class of multivariate generalized autoregressive conditional heteroskedasticity models. J. Bus. Econ. Statist. 20, 3 (2002), 339350.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fabozzi F. J., Kolm P. N., Pachamanova D. A., and Focardi S. M.. 2007. Robust Portfolio Optimization and Management. Wiley.Google ScholarGoogle Scholar
  16. [16] Faloutsos Christos, Flunkert Valentin, Gasthaus Jan, Januschowski Tim, and Wang Yuyang. 2020. Forecasting big time series: Theory and practice. In the Web Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Feng Fuli, He Xiangnan, Wang Xiang, Luo Cheng, Liu Yiqun, and Chua Tat-Seng. 2019. Temporal relational ranking for stock prediction. ACM Trans. Inf. Syst. 37, 2 (2019), 27:1–27:30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Flunkert Valentin, Salinas David, and Gasthaus Jan. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36, 3 (2020), 11811191.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Giles C. Lee, Lawrence Steve, and Tsoi Ah Chung. 2001. Noisy time series prediction using recurrent neural networks and grammatical inference. Mach. Learn. 44, 1/2 (2001), 161183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Gilmer Justin, Schoenholz Samuel S., Riley Patrick F., Vinyals Oriol, and Dahl George E.. 2017. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML’17).Google ScholarGoogle Scholar
  21. [21] Godfrey Luke B. and Gashler Michael S.. 2018. Neural decomposition of time-series data for effective generalization. IEEE Trans. Neural Netw. Learn. Syst. 29, 7 (2018), 29732985.Google ScholarGoogle Scholar
  22. [22] Hu Ziniu, Dong Yuxiao, Wang Kuansan, and Sun Yizhou. 2020. Heterogeneous graph transformer. In the World Wide Web Conference (WWW’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hu Ziniu, Liu Weiqing, Bian Jiang, Liu Xuanzhe, and Liu Tie-Yan. 2018. Listening to chaotic whispers: A deep learning framework for news-oriented stock trend prediction. In ACM International Conference on Web Search and Data Mining (WSDM’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Jiang Weiwei. 2021. Applications of deep learning in stock market prediction: Recent progress. Expert Syst. Applic. 184 (2021), 115537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Kazemi Seyed Mehran, Goel Rishab, Eghbali Sepehr, Ramanan Janahan, Sahota Jaspreet, Thakur Sanjay, Wu Stella, Smyth Cathal, Poupart Pascal, and Brubaker Marcus A.. 2019. Time2Vec: Learning a vector representation of time. Computing Research Repository (CoRR) abs/1907.05321 (2019).Google ScholarGoogle Scholar
  26. [26] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  27. [27] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  28. [28] Li Yaguang, Yu Rose, Shahabi Cyrus, and Liu Yan. 2018. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  29. [29] Lim Bryan, Arik Sercan Ömer, Loeff Nicolas, and Pfister Tomas. 2021. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 37, 4 (2021), 17481764.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lim Bryan and Zohren Stefan. 2021. Time series forecasting with deep learning: A survey. Philos. Trans. Roy. Societ. A 379, 2194 (2021), 20200209.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Lim Bryan, Zohren Stefan, and Roberts Stephen. 2020. Recurrent neural filters: Learning independent Bayesian filtering steps for time series prediction. In International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  32. [32] Linsmeier Thomas J. and Pearson Neil D.. 2000. Value at risk. Finan. Anal. J. 56, 2 (2000), 4767.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Liu Yeqi, Gong Chuanyang, Yang Ling, and Chen Yingyi. 2020. DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 143 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Lütkepohl Helmut. 2011. Vector Autoregressive Models. Springer Berlin.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Markowitz Harry. 1952. Portfolio selection. J. Finance 7, 1 (1952), 7791.Google ScholarGoogle Scholar
  36. [36] Oreshkin Boris N., Carpov Dmitri, Chapados Nicolas, and Bengio Yoshua. 2020. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations (ICLR’20).Google ScholarGoogle Scholar
  37. [37] Özen Serkan, Atalay Volkan, and Yazici Adnan. 2019. Comparison of predictive models for forecasting time-series data. In International Conference on Big Data Research.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Pantiskas Leonardos, Verstoep Kees, and Bal Henri E.. 2020. Interpretable multivariate time series forecasting with temporal attention convolutional neural networks. In IEEE Symposium Series on Computational Intelligence.Google ScholarGoogle Scholar
  39. [39] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Köpf Andreas, Yang Edward Z., DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An imperative style, high-performance deep learning library. In Annual Conference on Neural Information Processing Systems (NIPS’19).Google ScholarGoogle Scholar
  40. [40] Petropoulos Fotios et al. 2022. Forecasting: Theory and practice. Int. J. Forecast. (Jan.2022).Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Qin Yao, Song Dongjin, Chen Haifeng, Cheng Wei, Jiang Guofei, and Cottrell Garrison W.. 2017. A dual-stage attention-based recurrent neural network for time series prediction. In International Joint Conference on AI (IJCAI’17).Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Rangapuram Syama Sundar, Seeger Matthias W., Gasthaus Jan, Stella Lorenzo, Wang Yuyang, and Januschowski Tim. 2018. Deep state space models for time series forecasting. In Annual Conference on Neural Information Processing Systems (NIPS’18).Google ScholarGoogle Scholar
  43. [43] Sawhney Ramit, Wadhwa Arnav, Agarwal Shivam, and Shah Rajiv Ratn. 2021. FAST: Financial news and tweet based time aware network for stock trading. In Conference of the European Chapter of the Assoc. for Computational Linguistics (EACL’21).Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Schlichtkrull Michael Sejr, Kipf Thomas N., Bloem Peter, Berg Rianne van den, Titov Ivan, and Welling Max. 2018. Modeling relational data with graph convolutional networks. In the Extended Semantic Web Conference (ESWC’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Soviany Petru, Ionescu Radu Tudor, Rota Paolo, and Sebe Nicu. 2021. Curriculum learning: A survey. Int. J. Comput. Vis. 130, 6 (2022), 1526–1565.Google ScholarGoogle Scholar
  46. [46] Torres José F., Hadjout Dalil, Sebaa Abderrazak, Martínez-Álvarez Francisco, and Troncoso Alicia. 2021. Deep learning for time series forecasting: A survey. Big Data 9, 1 (2021), 321.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Wilson Granville Tunnicliffe. 2016. Time series analysis: Forecasting and control. J. Time Series Anal. 37 (2016).Google ScholarGoogle Scholar
  48. [48] Berg Rianne van den, Kipf Thomas N., and Welling Max. 2017. Graph convolutional matrix completion. Computing Research Repository (CoRR) abs/1706.02263 (2017).Google ScholarGoogle Scholar
  49. [49] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems (NIPS’17).Google ScholarGoogle Scholar
  50. [50] Velickovic Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Liò Pietro, and Bengio Yoshua. 2018. Graph attention networks. In International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  51. [51] Wan Renzhuo, Mei Shuping, Wang Jun, Liu Min, and Yang Fan. 2019. Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting. Electronics 8, 8 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wang Xiao, Ji Houye, Shi Chuan, Wang Bai, Ye Yanfang, Cui Peng, and Yu Philip S.. 2019. Heterogeneous graph attention network. In the World Wide Web Conference (WWW’19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wu Neo, Green Bradley, Ben Xue, and O’Banion Shawn. 2020. Deep transformer models for time series forecasting: The influenza prevalence case. Computing Research Repository (CoRR) abs/2001.08317 (2020).Google ScholarGoogle Scholar
  54. [54] Xu Keyulu, Li Chengtao, Tian Yonglong, Sonobe Tomohiro, Kawarabayashi Ken-ichi, and Jegelka Stefanie. 2018. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML’18).Google ScholarGoogle Scholar
  55. [55] Xu Wentao, Liu Weiqing, Xu Chang, Bian Jiang, Yin Jian, and Liu Tie-Yan. 2021. REST: Relational event-driven stock trend forecasting. In the World Wide Web Conference (WWW’21).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Yamada Ikuya, Asai Akari, Sakuma Jin, Shindo Hiroyuki, Takeda Hideaki, Takefuji Yoshiyasu, and Matsumoto Yuji. 2020. Wikipedia2Vec: An efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. In Conference on Empirical Methods in NLP.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yao Shaowei, Wang Tianming, and Wan Xiaojun. 2020. Heterogeneous graph transformer for graph-to-sequence learning. In Annual Meeting of the Association for Computational Linguistics (ACL’20).Google ScholarGoogle Scholar
  58. [58] Yoojeong Song, Won Lee Jae, and Jongwooy Lee. 2019. A study on novel filtering and relationship between input-features and target-vectors in a deep learning model for stock price prediction. Appl. Intell. 49, 3 (2019), 897–911.Google ScholarGoogle Scholar
  59. [59] Yu Bing, Yin Haoteng, and Zhu Zhanxing. 2018. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In International Joint Conference on AI (IJCAI’18).Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zerveas George, Jayaraman Srideepika, Patel Dhaval, Bhamidipaty Anuradha, and Eickhoff Carsten. 2021. A transformer-based framework for multivariate time series representation learning. In ACM International Conference on Knowledge Discovery and Data Mining (KDD’21).Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Zhao Ling, Song Yujiao, Zhang Chao, Liu Yu, Wang Pu, Lin Tao, Deng Min, and Li Haifeng. 2020. T-GCN: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transport. Syst. 21, 9 (2020), 38483858.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

(auto-classified)
  1. Investment and Risk Management with Online News and Heterogeneous Networks

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on the Web
                    ACM Transactions on the Web  Volume 17, Issue 2
                    May 2023
                    170 pages
                    ISSN:1559-1131
                    EISSN:1559-114X
                    DOI:10.1145/3589222
                    Issue’s Table of Contents

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 27 March 2023
                    • Online AM: 4 January 2023
                    • Accepted: 27 December 2022
                    • Revised: 28 May 2022
                    • Received: 15 December 2021
                    Published in tweb Volume 17, Issue 2

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article
                  • Article Metrics

                    • Downloads (Last 12 months)175
                    • Downloads (Last 6 weeks)26

                    Other Metrics

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader

                  HTML Format

                  View this article in HTML Format .

                  View HTML Format
                  About Cookies On This Site

                  We use cookies to ensure that we give you the best experience on our website.

                  Learn more

                  Got it!