Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning

In the field of quantitative trading, it is common practice to transform raw historical stock data into indicative signals for the market trend. Such signals are called alpha factors. Alphas in formula forms are more interpretable and thus favored by practitioners concerned with risk. In practice, a set of formulaic alphas is often used together for better modeling precision, so we need to find synergistic formulaic alpha sets that work well together. However, most traditional alpha generators mine alphas one by one separately, overlooking the fact that the alphas would be combined later. In this paper, we propose a new alpha-mining framework that prioritizes mining a synergistic set of alphas, i.e., it directly uses the performance of the downstream combination model to optimize the alpha generator. Our framework also leverages the strong exploratory capabilities of reinforcement learning~(RL) to better explore the vast search space of formulaic alphas. The contribution to the combination models' performance is assigned to be the return used in the RL process, driving the alpha generator to find better alphas that improve upon the current set. Experimental evaluations on real-world stock market data demonstrate both the effectiveness and the efficiency of our framework for stock trend forecasting. The investment simulation results show that our framework is able to achieve higher returns compared to previous approaches.


INTRODUCTION
Currently, it is almost a standard paradigm to transform raw historical stock data into indicative signals for the market trend in the field of quantitative trading [14].These signal patterns are called alpha factors, or alphas in short [19].Discovering alphas with high returns has been a trendy topic among investors and researchers due to the close relatedness between alphas and investment revenues.
The prevailing methods of discovering alphas can be in general divided into two groups, namely machine learning-based and formulaic alphas.Most recent research has focused on the former ones.These more sophisticated alphas are often obtained via deep learning models, e.g., using sequential models like LSTM [5], or more complex ones integrating non-standard data like HIST [23] and REST [24], etc.On the other end of the spectrum, we have the alphas that can be represented in simple formula forms.Such formulaic alphas are traditionally constructed by human experts using their domain knowledge and experience, often expressing clear economic principles.To name some, [7] demonstrates 101 alpha factors tested on the US stock market.Recently, research has also been conducted on frameworks that generate such formulaic alphas automatically [3,9,10,27].These approaches are able to find loads of new alphas rapidly without human supervision, while still maintaining relatively high interpretability compared to the more sophisticated machine learning-alphas.
Despite the existing approaches achieving remarkable success, however, they still have disadvantages in different aspects.Machine learning-based alpha factors are inherently complex and sometimes require more complex data other than the price/volume features.
In addition, although they are often more expressive, they often suffer from relatively low explainability and interpretability.As a result, when the performance of these "black box" models unexpectedly deteriorates, it is hard for human experts to tune the models accordingly.These algorithms are thus not favored under some circumstances due to concerns about risks.On the other hand, while formulaic alphas are more interpretable, previous research on this matter often focused on finding a single alpha factor that predicts well on its own.Nonetheless, it is often impossible to describe a complex and chaotic system such as the stock market with simple rules that human researchers can comprehend.As a compromise, a set of these alphas are oftentimes used together in practice, instead of using them individually.However, when multiple of these independently mined formula alphas are combined, the final prediction performance may not improve much because not much consideration is put into the synergistic effect between factors (see Section 4.2.2 for detail).In addition, these alphas are often simple in their forms, and their underlying mechanisms are often quite understandable.Once they are released to the public and become well-known among practitioners, their performance may deteriorate rapidly [7].
Therefore, the question we are facing is: Are we able to find a way to automatically discover interpretable alpha factors, which work well with downstream predictive models, without suffering possible performance deterioration due to the alpha factors being widely known to the general public?
To solve the above challenge, we formulate a new research problem in this paper, which is to find synergistic formulaic alpha factor sets.Using raw stock price/volume data as the input, we aim to search for a set of formulaic alpha factors instead of individual ones.Recall that finding a single well-performing alpha on given data is already a hard problem to resolve since the search space of valid formulas is vast and hard to navigate.The search space for alpha mining is often even larger than that of a typical symbolic regression problem [13].
The most intuitive approach to this problem would be using genetic programming (GP), performing mutations on expression trees to generate new alphas.In fact, most previous work on this matter is based on genetic programming (GP) [3,9,10,27], which is of course not a serendipitous choice since GP methods generally excel at such problems with large search spaces.However, GP algorithms often scale poorly due to the complexity of maintaining and mutating a huge population [13].In addition, the main challenge remains that mining a set of synergistic alphas all at once is an even harder problem with a much larger search space, the scale of which makes most existing frameworks infeasible to solve.
Hence, previous works mostly tried to find ways to simplify the problem of alpha set mining, by mining alphas one by one and filtering out a subset of them with respect to some similarity metric.The mutual information coefficient (IC) between the pairs of alpha in the set is often employed as the similarity "metric" [3,10,27].However, as we will demonstrate below, adding a new alpha that is of high IC to the ones in an existing pool of alpha may still bring a non-negligible boost of performance to the combined result, and vice versa.This phenomenon still exists even when the combination model is set to be a simple linear regressor.Therefore, the traditional approach to determining whether a set of alpha could be synergistic does not line up with the expected outcome.
To tackle the challenge that GP methods could be inefficient at exploring the vast search space of formulaic alphas, our framework utilizes reinforcement learning (RL) for achieving better results in exploration.Combined with the strong expressiveness of deep neural networks, RL with its excellent exploratory ability plays a predominant role in numerous areas.To list a few examples, game playing [16], natural language processing [11], symbolic optimization [13], and portfolio management [22].We implement a sequence generator with constraints to ensure valid formulaic alpha generation and employ a policy gradient-based algorithm to train the generator in the absence of a direct gradient.Since traditional mutual-IC filtering methods do not align well with the target of optimizing the combination model's performance, we propose to use directly the performance as the optimization objective of our alpha generator.Under this new optimization scheme, our generator is able to produce a synergistic set of alpha which fits the mine-and-combine procedure in a more suitable way.To evaluate our alpha-mining framework, we conduct extensive experiments over real-world stock data.Our experiment results demonstrate that the formulaic alpha sets generated by our framework perform better than those generated with previous approaches, shown both on the prediction metrics and investment simulations.
Our contributions can be summarized as follows.
• We propose a new optimization scheme that produces a set of alpha that suits downstream tasks better, regardless of what actual form the combination model takes.

PROBLEM FORMULATION 2.1 Alpha Factor
We consider a stock market with  stocks in a period of  trading days in total.On each trading day  ∈ {1, 2, • • • , }, each stock  corresponds to a feature vector   ∈ R  , comprised of  raw features such as opening/closing price in the recent  days1 .Finally, we define an alpha factor  as a function mapping feature vectors of all stocks on a trading day  ∈ R × into alpha values  =  ( ) ∈ R  .We will use the word "alpha" for both an alpha factor and its corresponding values in the following sections.

Alpha Factor Mining
To measure the effectiveness of an alpha, we calculate the information coefficient (IC) between the true stock trend it aims to predict Step-by-step computation of this alpha on an example time series.
∈ R  and the factor values  (  ).We denote the daily IC function as  : R  × R  → [−1, 1], which is defined as the Pearson's correlation coefficient: Such value can be calculated on every trading day between an alpha and the prediction target.For convenience, we denote the IC values between two sets of vectors averaged over all trading days as σ (, ) = E  [ (  ,   )].
We use the average IC between an alpha and the return to measure the effectiveness of an alpha factor on a stock trend series As mentioned above, the output of a combination model can be seen as a "mega-alpha", mapping raw inputs into alpha values.Therefore, we denote the combination model as  ( ; F ,  ), where F = { 1 ,  2 , • • • ,   } is a set of alphas to combine, and  denotes the parameters of the combination model.We would like the combination model to be optimal w.r.t. a given alpha set F on the training dataset , that is: Conclusively, the task of mining a set of alphas can be defined as the optimization problem argmax F  * (•; F ).

Formulaic Alpha
Formulaic alphas are expressed as mathematical expressions, consisting of various operators and the raw input features mentioned before.Some examples of the operators are the elementary functions (like "+" and "log") operated on one-day data, called crosssection operators, and operators that require data from a series of days, called time-series operators (e.g."Min(close, 5)" gives the lowest closing price of a stock in the recent 5 days).A list of all the operators used in our framework is given in Appendix A.
Such formulas can be naturally represented by an expression tree, with each non-leaf node representing an operator, and children of a node representing the operands.To generate such an expression, our model represents the expression tree by its postorder traverse, with the children's order also defined by the traversing order.In other words, the model represents a formula as its reverse Polish notation (RPN).It is easy to see that such notation is unambiguous since the arities of the operators are all known constants.See Figure 1 for an example of a formulaic alpha expression together with its corresponding tree and RPN representations.

METHODOLOGY
As illustrated in Figure 2, our alpha-mining framework consists of two main components: 1) the Alpha Combination Model, which combines multiple formulaic alphas to achieve optimal performance in prediction, and 2) the RL-based Alpha Generator, which generates formulaic alphas in the form of a token sequence.The performance of the Alpha Combination Model is used as the reward signal to train the RL policy in the Alpha Generator using policy gradientbased algorithms, such as PPO [17].Repeating this process, the generator is continuously trained to generate alphas that boost the combination model, thereby enhancing the overall predictive power.

Alpha Combination Model
Considering the interpretability of the combined "mega-alpha", the combination model itself should also be interpretable.In this paper, we use a linear model to combine the alphas.
The values evaluated from different alphas have drastically different scales, which might cause problems in the following optimization steps.To counter this effect, we centralize and normalize the alpha values with their average and standard deviation.Since Pearson's correlation coefficient is invariant up to linear transformation, this transformation does not affect the performance of the alphas when they are considered separately.Formally, we introduce a normalization operator N , that transforms a vector such that its elements have a mean of 0, and the vector has a length of 1: We will omit explicitly writing the N operator for simplicity.For the rest of this paper, we will assume that all the  ( ) evaluations and the targets  are normalized to have a mean of 0 and a length of 1 before subsequent computations.In other words, treat  as N •  and  as N ().
Given a set of  alpha factors (5) We define the loss of the combination model as the mean squared error (MSE) between model outputs and true stock trend values: To simplify the calculation of alpha combination, we have: Theorem 3.1.Let F be a set of  alphas and  be their respective weights, the MSE loss L () can be represented as: The proof of this theorem is provided in Appendix B. Notice that there is no   term on the RHS of Equation 7. Once we have obtained σ ( ) for each alpha  and their pairwise mutual correlations σ (  ( ),   ( )), we can then calculate the loss L () solely using these terms, saving time on calculating the relatively large   in each gradient descent step.
Considering time and space complexity, it is impractical to combine all generated alphas together, because to calculate mutual correlation for each pair of factors we need O ( 2 ) evaluations of mutual IC.The quadratic growth of this makes it expensive to apply the current procedure to a large number of alphas.However, a few dozen of alphas will suffice for practical uses.To a certain point, more alphas would not bring much more increment in performance, following the law of diminishing returns.We will demonstrate this effect in Section 4.2.2.
After the alpha generator outputs a new alpha, the alpha is first added to the candidate alpha set and assigned a random initial weight.Gradient descent is then performed to optimize the weights with respect to the extended alpha set.We also set a threshold to Obtain σ ( ) from calculation or cache; Obtain σ ( ( ),  ′ ( )) from calculation or cache; Calculate L () according to Equation 7; limit the size of the alpha set, leaving only the principal alphas with the largest absolute weight.If the amount of alphas in the extended set exceeds a certain threshold, the least principal alpha is removed from the set together with its corresponding weight.The pseudocode of the training procedure is shown in Algorithm 1.

Alpha Generator
The alpha generator models a distribution of mathematical expressions.As each expression can be represented as a symbolic expression tree, we use the reverse Polish notation (RPN) to represent it as a linear sequence, since traditional auto-regressive generators can only deal with sequences.To control and evaluate the generation process of valid expressions, we model the generation process as a non-stationary Markov Decision Process (MDP).We will describe the various components of the MDP below in the following paragraphs.An overview of the MDP-based Alpha generator is shown in Figure 3.

Tokens.
The token is an important abstraction in our framework.A token can be any of the operators, the features, or constant values.Table 1 shows some examples of such tokens.For the full list of operators, please refer to Section A; for the full list of features we have chosen, please refer to Section 4.1.1.The initial state is always BEG, so a valid state always starts with BEG and is followed by previously chosen tokens.Since we aim for interpretability of the alphas, and too long of a formula will instead be less interpretable, we cap the length threshold of the formulas at 20 tokens.

Action
Space.An action is a token that follows the current state (generated partial sequence).It is obvious that an arbitrarily generated sequence is not guaranteed to be the RPN of an expression, so we only allow a subset of actions to be taken at a specific state to guarantee the well-formedness of the RPN sequence.Please refer to Appendix C for more details.
3.2.4Dynamics.Given a state and an action, we can obtain the next state deterministically.The next state is generated by taking the current state's corresponding sequence and appending the action token at the end.
3.2.5 Rewards and Returns.The MDP does not give immediate rewards for partially formed sequences.At the end of each episode, if the final state is valid, the state will be parsed to a formulaic function and evaluated in the combination model shown in Algorithm 1.
To encourage our generator to generate novel alphas, we will then evaluate the new combination model with the new alpha added, and use the model's performance as the return of this episode.Since the reward varies together with the components of the alpha pool, the MDP is non-stationary.Contrary to common RL task settings, for alpha expression generation we do not necessarily want to penalize longer episodes (longer expressions).In fact, longer alphas that perform well are harder to find than shorter ones, due to exponential explosion of the search space.Consequently, we set the discount factor as  = 1 (no discount).3.2.6Reinforcement Algorithm.Based on the MDP defined above, we use Proximal Policy Optimization (PPO) [17] to optimize a policy   (  |  ) that takes a state as input and outputs a distribution of action.An actual action will be sampled from the output distribution.
PPO is an on-policy RL algorithm based on the trust region method.It proposed a clipped objective L   as follows: where (  |  ) and Â is an estimator of the advantage function at timestep .Using the importance sampling mechanism, PPO can effectively take the biggest possible improvement while keeping the policy in a trust region that avoids accidental performance collapse.
Since our MDP has complicated rules for the legality of actions, an action sampled from the full discrete action distribution predicted by the learned policy is likely to be invalid as mentioned in Section 3.2.3.We adopt the Invalid Action Masking mechanism [6] to mask out invalid actions and just sample from the set of valid actions.

Network Architecture
The PPO algorithm requires the agent to have a value network and a policy network.Under our experiment settings, the two networks share a base LSTM feature extractor that converts token sequences into dense vector representations.Separate value and policy "heads" are attached after the LSTM.The values of hyperparameters are given in Appendix D.

Training with policy gradient-based methods
For the task of alpha mining, we do not require the agent to achieve relatively high average returns in each episode, but place more importance on the trajectories the agent takes in the whole training process.For this reason, we maintain a pool of alphas without resetting between episodes.We run the alpha generation procedure mentioned in Section 3.2 and optimize the alpha combination model according to Section 3.1 repeatedly.In this way, we train the policy to continuously generate novel alpha factors that bring improvement to the overall prediction performance.The proposed alpha mining process is shown in Algorithm 2. Our implementation is publicly available 2 .

EXPERIMENTS
Our experiments are designed to investigate the following questions: • Q1: How does our proposed framework compare to prior alpha mining methods?• Q2: How well does our model scale as the alpha set size increases?• Q3: Compared to the more commonly used mutual correlation, why is combination model IC a better metric?• Q4: How does our framework perform under more realistic trading settings?Since only using the top-most alpha to evaluate the frameworks are extremely prone to overfitting on the training data, we also constructed alpha sets with the ones generated by the two single alpha generators.The same combination model is then applied to these alpha sets.Note that the generators still emit alphas in a one-by-one manner, and are agnostic to the combination model's performance.The first method to construct the set (top) is to simply select the top- alphas emitted by the generator with the highest IC on the training set.The second method (filter) is to select the top- performing alphas with a constraint that any pair of alpha from the set must not have a mutual IC higher than 0.7.
To better evaluate the model performance, we also compared our approach to several end-to-end machine learning models implemented in the open-source library Qlib [25].The models receive 60 days' worth of raw features as the input, and are trained to predict the 20-day returns directly.Note that these models do not generate formulaic alphas.The hyperparameters of these models are set according to the benchmarks given by Qlib.
• XGBoost [2] is an efficient implementation of gradient boosting algorithms, which ensembles decision trees to predict stock trends directly.• LightGBM [8] is another popular implementation of gradient boosting.• MLP: A multilayer perceptron (MLP) is a type of fully-connected feedforward artificial neural network.
To demonstrate the effect caused by stochasticity in the training process, each experimental combination with an indeterministic training process is evaluated with 10 different random seeds.

Evaluation Metrics.
We choose two metrics to measure the performance of our models as follows.• IC, the Pearson's correlation coefficient shown in Eq. 1.
• Rank IC, the rank information coefficient.The rank IC tells how much the ranks of our alpha values are correlated with the ranks of future returns.Rank IC is defined by replacing Pearson's correlation coefficient with Spearman's correlation coefficient.The rank IC is just the IC of ranked data, defined as follows: rank (, ) =  ( (),  ()), where  (•) is the ranking operator.The ranks of repeated values are assigned as the average ranks that they would have been assigned to 5 .Both of the metrics are the higher the better.

Comparison across all alpha generators.
To answer Q1, we first compare our framework against several other alpha-mining methods and direct stock trend forecasting baselines, including PPO, GP, MLP, LightGBM, and XGBoost.Experiments are conducted on CSI300 and CSI500 stocks respectively.
The results are shown in Table 2. Our framework is able to achieve the highest IC and rank IC across all the methods we compare to.Note that the framework is only explicitly optimized against the IC metric.The non-formulaic alpha models come in the second tier.The baseline formulaic alpha generators perform poorly on the test set, especially the RL-based ones.The reinforcement learning agent, when optimized only against single-alpha IC, is prone to falling into local optima and thus overfitting on the training set, and basically stops searching for new alphas after a certain amount of steps.On the other hand, the GP-based methods maintaining a large population can avoid the same problem, but still cannot produce alphas that are synergistic when used together.The results

Comparison of formulaic generators with varying pool capacity.
To answer Q2, we study the four baseline formulaic alpha generators more extensively, and compare them to our proposed framework.The models are evaluated under pool sizes of  ∈ {1, 10, 20, 50, 100}.The results are shown in Figure 4.
Compared to the baseline method PPO_filter, our method directly uses the combination model's performance as the reward to newly generated alphas.This leads to a substantial improvement when the pool size increases, meaning that our method can produce alpha sets with great synergy.Our method shows scalability for pool size: even when the pool size is large enough, it can still continuously find synergistic alphas that boost the performance over the existing pool.Conversely, the combined performance of the alphas generated by other approaches barely improves upon the case with just the top alpha, meaning that these alpha factors have poor synergy.Furthermore, the ability to control the reward of individual expressions under a certain alpha pool configuration is granted by the flexibility of the RL scheme.The GP scheme of maintaining a large population at the same time does not work well with fine-grained fitness value control.
Also, we can see that for the CSI500 dataset, GP_filter performs worse than GP_top on the IC metric when the pool size increases.This phenomenon demonstrates that the traditionally used mutual-IC filtering is not always effective, answering the question Q3.

Case Study
Table 3 shows an example combination of 10 alphas generated by our framework, evaluated on the CSI300 constituent stock set.Most of the alpha pairs in this specific set have mutual IC values over 0.7.Previous work [10][27] considered this to be too high for the individual alphas to be regarded as "diverse", yet these alphas are able to work well in a synergistic manner.For example, the alphas #2 and #6 have a mutual IC of 0.9746, thus traditionally considered too similar to be useful cooperatively.However, the combination 0.09317 2 − 0.07163 6 achieves an IC of 0.0458 on the test set, even higher than the sum of the respective ICs, showing the synergy effect.Also, although alpha #1 only has an IC of 0.0011, it still plays a vital role in the final combination.Once we remove alpha #1 from the combination and re-train the combination weights on the remaining set, the combination's IC drops to merely 0.0447.The two observations above show that neither the single alpha IC nor the mutual IC between alpha pairs is a good indicator of how well the combined alpha would perform, answering Q3.
One possible explanation for these phenomena is that: Although traditionally these alphas are similar due to the high mutual IC, some linear combinations of the alphas could point to a completely different direction from the original ones.Consider two unit vectors in a linear space.The more similar these two vectors are, the less similar either of these vectors is to the difference between the two vectors, since the difference vector approaches to be perpendicular to either of the original vectors as the vectors get closer.

Investment Simulation
To demonstrate the effectiveness of our factors in more realistic investing settings, we use a simple investment strategy and conducted backtests in the testing period (2020/01/01 to 2021/12/31) on the CSI300 dataset.We use a simple top-/drop- strategy to simulate the investment: On each trading day, we first sort the alpha values of the stocks, and then select the top  stocks in that sorted list.We evenly invest across the  stocks if possible, but restrict the strategy to only buy/sell at most  stocks on each day to reduce excessive trading costs.In our experiment,  is set to 50 and  to 5.
We recorded the net worth of the respective strategies in the testing period, of which a line chart is shown in Figure 5.Although our framework does not explicitly optimize towards the absolute returns, the framework still performs well in the backtest.Our framework is able to gain the most profit compared to the other methods.

RELATED WORK
Formulaic alphas.The search space of formulaic alphas is enormous, due to the large amount of possible operators and features to choose from.To our best knowledge, all notable former work uses genetic programming to explore this huge search space.[10] augmented the gplearn library with formulaic-alpha-specific timeseries operators, upon which an alpha-mining framework is built.[9] further improved the framework to also mine alphas with nonlinear relations with the returns by using mutual information as the fitness measure.[27] used mutual IC to filter out alphas that are too similar to existing ones, improving the diversity of resulting alpha sets.PCA is carried out on the alpha values for reducing the algorithmic complexity of computing the mutual ICs, and various other tricks are also applied to aid the evolution process.AlphaEvolve [3] evolves new alphas upon existing ones.It allows combinations of much more complex operations (for example matrix-wise computations), and uses computation graphs instead of trees to represent the alphas.This leads to more sophisticated alphas and better prediction accuracy, although at the risk of lowering the alphas' interpretability.Mutual IC is also used as a measure of alpha synergy in this work.
Machine learning-based alphas.The development of deep learning in recent years has brought about various new ideas on how to accurately model stock trends.Early work on stock trend forecasting treats the movement of each stock as a separate time series, and applies time series models like LSTM [5] or Transformer [21] to the data.Specific network structures catered to stock forecasting like the SFM [26] which uses a DFT-like mechanism have also been developed.Recently, research has also been conducted on methods to integrate non-standard data with the time series.REST [24] fuses multi-granular time series data together with historical event data to model the market as a whole.HIST [23] utilizes concept graphs on top of the regular time series data to model shared commonness between future trends of various stock groups.One specific type of machine learning-based model is also worth mentioning.Decision tree models, notably XGBoost [2], LightGBM [8], etc., are often considered interpretable, and they could also achieve relatively good performance on stock trend forecasting tasks.However, whether a decision tree with extremely complex structure is considered "interpretable" is at least questionable.When these tree models are applied to raw stock data, the high dimensionality of input only exacerbates the aforementioned problem.Our formulaic alphas use operators that apply to the input data in a more structured manner, making them more easily interpretable by curious investors.
Symbolic regression.Symbolic regression (SR) concerns the problem of discovering relations between variables represented in closed-form mathematical formulas.SR problems are different from our problem settings that there always exists a "groundtruth" formula that precisely describes the data points in an SR problem, while stock market trends are far too complex to be expressed in the space of formulaic alphas.Nevertheless, there remain similarities between the two fields since similar techniques can be used for the expression generator and the optimization procedure.[15] suggested using a custom neural network whose activation functions are symbolic operators to solve the SR problem.[13] proposed a novel symbolic regression framework based on an autoregressive expression generator.The generator is optimized using an augmented version of the policy gradient algorithm that values the top performance of the agent more than the average.[12] developed a method similar to [13], but also introduced GP into the optimization loop, seeding the GP population with RL outputs.[20] applied the language model pretraining scheme to symbolic regression, training a generative autoregressive "language model" of expressions on a large dataset of synthetic expressions.
Discussions.Although the term "formulaic alpha" is often tied down to investing, the concept of simple and interpretable formulaic predictors that could be combined into more expressive models is not limited to quantitative trading scenarios.Our framework can be adapted to solve other time-series forecasting problems, for example, energy consumption prediction [4], anomaly detection [1], biomedical settings [18], etc.In addition, we chose the linear combination model in this paper for its simplicity.Meanwhile, in theory, other types of interpretable combination models, for example, decision trees can also be integrated into our framework.In that sense, providing these combination models with these features expressed in relatively straightforward formulas might help provide investigators with more insights into how the models come to the final results.

CONCLUSION
In this paper, we proposed a new framework for generating interpretable formulaic alphas to aid investors in quantitative trading.We proposed to directly use the performance boost brought about by the newly added alpha to the existing alpha combination as the metric for alpha synergy.As a result, our framework can produce sets of alphas that could cooperate satisfactorily with a combination model, notwithstanding the actual form of the combination model.For the model to explore the vast search space of formulaic alphas more effectively, we also formulated the alpha-searching procedure as an MDP and applied reinforcement learning techniques to optimize the alpha generator.Extensive experiments are conducted to demonstrate that the performance of our framework surpasses those of all previous formulaic alpha-mining approaches, and that our method can also perform well under more realistic trading settings.That is to say, the Pearson's correlation coefficient between two vectors equals the inner product of the two vectors centralized and normalized.
Therefore the theorem can be proved as follows.Recall that   (  ) and   are normalized.

C.1 Formal Legality
An RPN can be built with a stack of expressions, constants, or raw features.The RPN building procedure follows the rules below, and actions that may violate these rules will be masked.
• TS (time-series) operators must take a time-delta (e.g.10d for a time-difference of 10 days) as its last parameter; • Excluding the aforementioned time-delta, each operator must take enough expressions as operands, according to the arity of the operator (one for *-Unary, two for *-Binary); • A multi-token expression should not be equivalent to a constant; • The special SEP token (end of expression) is only allowed when the generated sequence is already a valid RPN.For example, when the stack (state) is currently [$open, 0.5], we can choose the "Add" token (a binary operator), building an expression "Add($open, 0.5)".Meanwhile, the operator "Log" is not allowed here because "Log" will take "0.5" and "Log(0.5)" is a constant; similarly, the operator "TS-Mean" is also invalid because "Mean($open, 0.5)" is illegal.

C.2 Semantic Legality
Some expressions with correct forms might still fail to evaluate due to more constraints imposed by the operators.For example, the logarithm operator cannot be applied to a non-positive value.This kind of semantic invalidity is not directly detected by the procedure mentioned in the last section.In our experiments, these expressions are given the reward of -1 (the minimum value of Pearson's correlation coefficient) to discourage the agent from generating these expressions.

D HYPERPARAMETERS
The LSTM feature extractor used in the RL agent has a 2-layer structure with a hidden layer dimension of 128.A dropout rate of 0.1 is used in the LSTM network.The separate value and policy heads are MLPs with two hidden layers of 64 dimensions.PPO clipping range  is set to 0.2.

Figure 1 :
Figure 1: (A) An example of a formulaic alpha.(B) Its equivalent expression tree.(C) Its reverse Polish notation (RPN).Note that BEG and SEP are sequence indicators later mentioned in our framework.(D).Step-by-step computation of this alpha on an example time series.

Figure 2 :
Figure 2: An overview of our alpha-mining framework.(A) An alpha generator that generates expressions, optimized via a policy gradient algorithm.(B) A combination model that maintains a weighted combination of principal factors and, in the meantime, provides evaluative signals to guide the generator.

Figure 3 :
Figure 3: An illustration of our alpha generation framework.

3. 2 . 2
State Space.Each state in the MDP corresponds to a sequence of tokens denoting the currently generated part of the expression.

Figure 4 :
Figure 4: The results of ablation study.A pool size of 1 refers to settings that only evaluate the top-most alpha without using a combination model.

Figure 5 :
Figure 5: Backtest results on CSI 300.The lines track the net worth of simulated trading agents utilizing the various alphamining approaches.

Table 1 :
Tokens used in our framework.

Table 2 :
Main results on CSI 300 and CSI 500.Values outside parentheses are the means, and values inside parentheses are the standard deviations across 10 runs.

Table 3 :
An example combination of 10 alphas.

Table 4 :
All the operators used in our framework.CS: cross-section, TS: time-series, U: unary, B: binary.The expression  evaluated at  days before the current day.Mean(, ), Med(, ), Sum(, ) TS-U The mean/median/sum value of the expression  evaluated on the recent  days.Std(, ), Var(, ) TS-U The standard deviation/variance of the expression  evaluated on recent  days.Max(, ), Min(, ) TS-U The maximum/minimum value of the expression  evaluated on the recent  days.Mad(, ) TS-U The mean absolute deviation E [| − E [] |] of the expression  evaluated on the recent  days.Delta(, ) TS-U The relative difference of  compared to  days ago,  − Ref (, ).WMA(, ), EMA(, ) TS-U Weighted moving average and exponential moving average of the expression  evaluated on the recent  days.Cov(, , ) TS-B The covariance between two time series  and  in the recent  days.Corr(, , ) TS-B The Pearson's correlation coefficient between two time series  and  in recent  days.