CICC | Machine Learning Series (1): Exploring Factor Construction Paradigms Using Deep Reinforcement Learning Models

As an important branch of machine learning models, reinforcement learning models are widely used in various fields, from AlphaGo to ChatGPT. In the financial field, reinforcement learning also has the advantages of not needing to be independent and equally distributed. In this paper, the stock selection factors generated by the structure of reinforcement learning and feature extraction have achieved good stock selection performance in multiple stock pools, and the model performance has low sensitivity to parameters and high out-of-sample stability.

summary

Why try reinforcement learning models in quantization

As one of the important branches of machine learning, reinforcement learning is not lacking in the LLM large language model, which has been highly discussed in recent years, and AlphaGo, which has defeated the world champion in the field of Go. Reinforcement learning has been shown to perform well on a variety of tasks in different application scenarios. We believe that reinforcement learning may have good results in the financial field, especially in quantitative strategies, mainly because of the following four characteristics of reinforcement learning models: 1. It is suitable for handling sequential decision-making tasks, 2. The input data does not need to follow the assumption of independent and identical distribution, 3. The current strategy is continuously optimized through interactive exploration with the environment, and 4. The data does not need to be labeled.

The essence of factor construction: the organic combination of data and operators

Data + Operator: The process of factor mining essentially belongs to the combination of finding data and operators, and the mining method can be divided into two types: manual mining or model mining. The factors we have shown in a series of previous factor manuals are all manually constructed through a certain logic. Although the method of manual construction is more deterministic than that of machines, it is theoretically far less efficient than machine learning models.

Feature extraction module + reinforcement learning model: In order to seek high certainty of machine mining factors, we construct a dataset containing 6 common daily price and volume features by combining the reinforcement learning model and the feature extraction module, and define a data set of 22 operator operators and 19 constant operators. The feature extraction module will extract features mainly through linear or nonlinear methods for factor expressions, while the reinforcement learning model is mainly responsible for learning how to organically combine data features, operators, and constant operators to efficiently find reasonable factor paradigms.

The external stability of TRPO samples is high

Under our testing framework, the reinforcement learning model performs significantly better than the genetic algorithm and traditional machine learning methods as a benchmark for comparison. Among them, the two combination schemes of TRPO_LSTM and A2C_Linear have performed well in the backtest results of the CSI 1000 range: the ICIR is about 0.90, and the excess Sharpe outside the sample is more than 1.1, and the performance is still stable in the rapid market retracement environment at the beginning of this year. In contrast, the net return curve of the two control methods showed a significant drawdown at the beginning of the year, with an excess return of less than 2%.

The stability of machine learning models has also been one of the key concerns of investors. We fixed the reinforcement learning model and the feature extraction module respectively, and counted the average performance of the synthetic factors in the ICIR and excess return of the out-of-sample backtest. The experimental results show that the synthetic factors obtained by the TRPO, A2C and PPO models participating in the factor paradigm mining have relatively stable ICIR performance, all of which exceed 0.80. In the feature extraction module, the factor output of the model with the participation of Transformer has the best ICIR performance, reaching 0.79.

Explanation of the relatively stable model structure of TRPO: 1) Compared with other reinforcement learning models, TRPO uses the method of trust domain optimization to ensure the smoothness and stability of the policy improvement process by limiting the pace of policy update. 2) TRPO adaptively adjusts the learning rate on each update to keep the policy updated within the trust domain, so it is not particularly sensitive to the learning rate parameter. 3) The objective function of TRPO optimization uses Generalized Advantage Estimation (GAE) to estimate the strategy gradient, and combines the estimation of the value function to reduce the variance, which makes it less sensitive to noise and estimation error in the reward function.

risk

The model is built on historical data, and there may be a risk of failure in the future.

body

Reinforcement learning in the field of quantification

Why choose reinforcement learning models

One of the things that is often overlooked when using traditional statistical and machine learning models is the assumptions about the data. For example, for machine learning models such as linear regression, logistic regression, naïve Bayes, and KNN, a fundamental assumption is that the input data needs to obey an independent identical distribution. For financial data, independent distribution is often an overly strict premise.

► Time correlation: Financial data is usually time series data, and there may be correlations between data at adjacent time points. For example, stock prices may exhibit some autocorrelation or correlation structure over a short period of time.

► Volatility clustering: Volatility in financial markets usually occurs in volatility clustering, where large changes in volatility tend to be clustered together rather than evenly distributed. This means that the volatility of financial data is not independently and equally distributed.

► Heteroscedasticity: Heteroscedasticity, which is common in financial data, is manifested as data with different variances at different points in time. This violates the assumption of independent isodistribution, as the variance is not constant.

► Non-normal distribution: Many financial data do not follow a normal distribution, but have skewed, thick-tailed, or other non-normal distribution characteristics.

Due to these special properties, we need to carefully consider the suitability of financial data when using machine learning or deep learning models, rather than forcibly inputting data directly into the model. Reinforcement learning, on the other hand, does not require input data to meet this requirement. In addition, reinforcement learning is explored through trial and error through interaction with the environment, and the mode of further optimization of the current strategy and the update and iteration of the quantitative strategy also have many similarities.

We believe that reinforcement learning may have good results in the financial field, especially in quantitative strategies, mainly because of the following four characteristics of reinforcement learning models: 1. It is suitable for handling sequential decision-making tasks, 2. The input data does not need to follow the assumption of independent and identical distribution, 3. The current strategy is continuously optimized through interactive exploration with the environment, and 4. The data does not need to be labeled.

Figure 1: History of reinforcement learning development

CICC | Machine Learning Series (1): Exploring Factor Construction Paradigms Using Deep Reinforcement Learning Models

Source: Wind, CICC Research

How to find the construction paradigm of the factor

The essence of factor construction: the organic combination of data and operators

In this paper, we use reinforcement learning models to find effective factor construction paradigms, which specifically refer to factor expression formulas that include data features such as quantity and valence and mathematical operators. In practice, we use the tree structure to represent the factor paradigm: the non-leaf node represents the operator, and the child nodes of the node represent the operand, this paper calls each node an operator (token), and with the help of the idea of inverse Polish expression, the tree structure is saved as a sequence of post-order traversals, which effectively gives full play to the advantages of the inverse Polish expression without ambiguity and easy to be parsed and calculated by computer programs. This task has better explanatory properties than the task of directly predicting yield.

Figure 2: How data and operators are combined - inverse Polish expressions

NOTE: (A) AN EXAMPLE OF A FACTOR PARADIGM, (B) A TREE STRUCTURE CORRESPONDING TO A FACTOR PARADIGM, (C) A RESULT USING INVERSE POLISH NOTATION (RPN), WHERE BEG AND SEP REPRESENT THE SEQUENCE INDICATORS, AND (D) A STEP-BY-STEP CALCULATION OF THE ALPHA FACTOR ON AN EXAMPLE TIME SERIES

资料来源：“Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning.” Shuo Yu等（2023），中金公司研究部

At the level of model architecture, this paper integrates the multi-factor mining task and factor synthesis task into the end-to-end reinforcement learning framework in the form of concatenation, so as to exert its powerful exploration ability. Specifically, the framework mainly includes two modules: the alpha generator based on reinforcement learning and the alpha combination model. The main function of the alpha generator is to mine the factor paradigm and add the effective factor paradigm to the factor pool, giving random synthetic weights. Then, the Alpha combinatorial model uses the gradient descent optimization method to linearly combine the factors in the factor pool and optimize their respective weights. We backtest the factors output by the combined model, and use the IC results as reward signals to train reinforcement learning strategies in an alpha generator based on a policy gradient algorithm. The adoption of such a training structure and process can prompt the alpha generator to generate factors that improve the combined model in repeated training and optimization, thereby enhancing the overall prediction ability.

Figure 3: Reinforcement learning factor mining framework

Note: (A) An alpha generator that generates expressions, optimized by a policy gradient algorithm. (b) A combinatorial model that maintains a weighted combination of major factors, while providing evaluation signals to guide the generator.

资料来源：“Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning.” Shuo Yu等（2023），中金公司研究部

Test Framework: Feature Extraction + Reinforcement Learning

The alpha generator consists of two core modules: the reinforcement learning module and the feature extraction module.

► Reinforcement learning module: Considering that the task in this paper is a mining factor paradigm, the Markov decision-making process of modeling uses a set of actions defined in a discrete space, and each new action needs to be taken according to the current sequence to filter the legal operators, so we mainly consider the reinforcement learning model and the Maskable mechanism using the actor-critic architecture.

► Feature Extraction Module: The feature extraction module is mainly responsible for converting discrete operator sequences, i.e., factor expressions, into continuous abstract representations as inputs to the reinforcement learning network module. The value network and policy network of the reinforcement learning model share an extraction module of input features.

Considering the interpretability of the synthesis factor, this paper only adopts the traditional linear combination scheme, and does not introduce other machine learning/deep learning methods, so the technical details of this module will not be repeated here, and the meaning and derivation process of the loss function are detailed in the appendix of the original article.

Since the reinforcement learning model cannot directly read the discrete form of factor expressions, the feature extraction module will mainly extract features through linear or nonlinear methods of factor expressions, and the reinforcement learning model is mainly responsible for learning how to organically combine data features, operators and constant operators, and find a reasonable strategy for combining features and operators.

Figure 4: A combination of a feature extraction module and a reinforcement learning model, with the former extracting abstract features of factor expressions

Source: Wind, CICC Research

TRPO+LSTM:兼顾收益与稳定

Backtest results: TRPO+LSTM has better performance and high stability in the sample

The combination scheme that stands out in the out-of-sample backtesting results of the CSI 1000 dataset is the TRPO_LSTM model. The average synthetic factor IC of the output is 6.35%, the long-short return is 22.99%, and it has an excess return of 7.83% and an excess Sharpe ratio of 1.56. Compared with the A2C_Linear model with better performance of the whole sample, the TRPO_LSTM method obtained by multiple random number initialization parameter training has better average performance and stability outside the sample. In addition, in the calculation of the correlation coefficient with the common factor, the cross-sectional correlation of the factor is kept within 0.5.

Figure 5: The effectiveness test results of the composite factors of the composite factor of the reinforcement learning and feature extraction module combination model in the CSI 1000 range of monthly backtests

Note: 1) The sample interval is from 2021-03-01 to 2024-03-01, and 2) the reinforcement learning and genetic algorithm models were trained three times with different random number initialization parameters to obtain average results

Source: Wind, CICC Research

Reinforcement learning performance and transparency are better

After experimental comparison, we found that the reinforcement learning model performed significantly better than the genetic algorithm and machine learning methods outside the sample. Among them, the average results of the out-of-sample backtest in the CSI 1000 range are more prominent in the TRPO_LSTM and A2C_Linear combinations: the ICIR is about 0.90, the excess Sharpe exceeds 1.1, and has a cumulative excess return of 7.83% and 5.32%. In contrast, the net return curve of the two control methods showed a significant drawdown at the beginning of the year, with an excess return of less than 2%.

Figure 6: The effectiveness test results of the reinforcement learning model (partial) and the composite factor of the control method in the monthly backtest outside the CSI 1000 sample

Source: Wind, CICC Research

Parameter sensitivity analysis of the model

This section mainly analyzes the sensitivity of factor performance to the combination of feature extraction module and reinforcement learning model. The reinforcement learning model and feature extraction module were fixed respectively, and the average performance of the synthetic factors in the out-of-sample backtest of ICIR and excess return was counted. The experimental results show that the synthetic factors obtained by the TRPO, A2C and PPO models participating in the factor paradigm mining have relatively stable ICIR performance, all of which exceed 0.80. In the feature extraction module, the factor output of the model with the participation of Transformer has the best ICIR performance, which is 0.79. The specific test results are shown in the original report.

This section mainly explores the stability of the parameters from the perspectives of factor pool and model hyperparameters TRPO_LSTM combined model performance. Compared with other reinforcement learning models, the performance of TRPO is more stable and less sensitive to parameter changes. We believe that the main reasons may be as follows. Starting from the design principles of TRPO:

► TRPO uses the method of trust domain optimization to ensure a smooth and stable policy improvement process by limiting the pace of policy updates, thereby reducing the risk of performance crashes caused by large updates.

► TRPO adaptively adjusts the step size (or learning rate) on each update to keep the policy updated within the trust domain. Since it automatically adjusts the step size to meet the constraints of KL divergence, the algorithm is not particularly sensitive to the learning rate.

► The objective function of TRPO optimization uses generalized advantage estimation (GAE) to estimate the strategy gradient, and combines the estimation of the value function to reduce the variance. The design makes it less sensitive to noise and estimation error in the reward function.

The reinforcement learning model used in this paper always implements a balance between computational efficiency and storage overhead. While it is true that a larger network hidden layer dimension and a deeper number of network layers may bring better fitting results, the complex network structure faces the risk of reduced computational efficiency and model overfitting. Therefore, in practical application, how to balance the relationship between the two is also a problem that cannot be ignored.

Figure 7: Main parameter settings and impact analysis of TRPO_LSTM model

资料来源：“Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning.” Shuo Yu等（2023），Wind，中金公司研究部

Article source:

This article is excerpted from: "Machine Learning Series (1): Exploring Factor Construction Paradigms Using Deep Reinforcement Learning Models" published on April 7, 2024

Xiaoxiao Zhou Analyst SAC License No.: S0080521010006 SFC CE Ref: BRA090

Zheng Wencai Analyst SAC License No.: S0080523110003 SFC CE Ref: BTF578

Chen Yiyun Contact SAC License No.: S0080122080368 SFC CE Ref: BTZ190

Junwei Liu Analyst SAC License No.: S0080520120002 SFC CE Ref: BQR365

Legal Notices

CICC | Machine Learning Series (1): Exploring Factor Construction Paradigms Using Deep Reinforcement Learning Models

Read on