Use machine learning models to build quantitative timing strategies (with full-process code)

Zero. Machine learning, a taste

Human beings generally use the inductive deductive method to discover the law, to summarize a large number of observed phenomena, form a law in the mind, and then encounter a similar situation, you can quickly make predictions and judgments, such as "the morning glow does not go out, the sunset travels thousands of miles", "the snow is a good year", "the dog sneezes and the sky is clear", which are the crystallization of the inductive deduction method of the ancestors, and may not be effective every time.

Use machine learning models to build quantitative timing strategies (with full-process code)

Machine learning is very similar to the human thinking process, which is to input historical data into the model, train a mathematical model that can complete a specific task, and when new data appears, the new data is fed into the trained model, which will output a prediction result, in particular, machine learning has an advantage over the human brain in identifying nonlinear laws.

Specifically, combined with our goal this time, our task this time is to use the machine learning model to predict the rise and fall of the CSI 300 index on the next day, and the input data is the opening and closing market data of the CSI 300 index, and input these data into the support vector machine (SVM) for model training.

Because the dataset we use this time is the daily market data of the CSI 300 index, which has only been around 4,000 trading days since its listing in 2005, in other words, there are only more than 4,000 sample points, which is at best a "small sample" compared to millions of "big data", which is undoubtedly a very suitable application scenario for SVM.

Because when deep learning (which generally requires big data to be "fed") has not yet fully emerged, SVM belongs to the trend circle of machine learning methods at that time due to its high prediction accuracy of small samples and its ability to solve nonlinear classification problems.

SVM is very capable of playing, but it is not very complicated in itself, and today I will not pick out the mathematical details, and talk about what SVM is and what is the use, so as not to dissuade everyone from quitting without starting.

SVM was originally designed to solve the binary classification problem, and later expanded to multi-classification problems, "predicting the rise and fall of the CSI 300 index" is a typical binary classification problem, "up" is one category, and "down" is another.

It linearly separates the two types of samples by finding a maximum spaced hyperplane (black slash line in the figure above), and ensures that the distance from the nearest edge point of the two samples to this plane is the largest, because the maximum spaced hyperplane only depends on the edge points of the two categories, such as the red and blue dots crossed by the red and blue lines in the above figure, these points are called support vectors, which is where the name "support vector machine" comes from.

But the real world is wonderful, there are linearly separable data sets, there are non-linearly separable data sets, so what should we do when we encounter non-linearly separable data?

There is also a way, SVM introduces kernel functions, which can map low-dimensional indivisible data to high-dimensional linear separability, as shown in the figure above, two-dimensional indivisible is mapped to three-dimensional, commonly used kernel functions are linear kernels, polynomial kernels, Gaussian kernels (RBF cores) and Sigmoid cores.

However, in reality, due to the existence of noise and extreme sample points, the dataset may be linearly indistinguishable in both low latitude and high dimension, so the concept of relaxation variable is introduced into SVM, which allows the maximum interval hyperplane not to perfectly distinguish between the two categories, and allows the existence of misclassification, and the SVM controls the tolerance of these misclassifications through the penalty coefficient C, the higher the C value, the higher the classification accuracy, but too high a value can easily lead to overfitting, and too low a C value will lead to impaired accuracy.

After the SVM chatter, let's talk about the general machine learning modeling process, which is generally divided into 6 steps, which are in order of collecting data, preparing data, selecting/building models, training models, testing models, and adjusting parameters. Quantify, there is no road taken in vain, every step counts, let's go through it step by step~

1. Data collection

The first step is to obtain the most primitive modeling data, for the daily data of the index, there are many free access channels, as long as you can get the date (date), opening price (open), high price (high), low price (low) and closing price (close) of the index.

Take Tushare as an example here to get all the market data of the CSI 300 Index since its listing on April 8, 2005, which is a free, open-source Python financial data interface package at http://tushare.org/.

After importing the tushare package, you can use its get_k_data function to get the historical K-line data of the CSI 300 index, and the returned data format is dataframe, we only select the open, high and low closing data, and set the date (string format) as the index.

import numpy as np
import pandas as pd
import talib
import warnings
warnings.filterwarnings('ignore')
import tushare as ts


data = ts.get_k_data(code='hs300', start='2005-04-08', end='2022-11-08', ktype='D')
data = data.set_index('date')
data = data[['open', 'high', 'low', 'close']]
print('样本数目：%d' %data.shape[0])
print(data.head(10))
print(45*'-')
print(data.tail(10))

2. Data preparation

Once we have the raw data, we need to further process and process, and the main work is variable selection, confirmation labeling, and data cleaning.

Variable selection is called "factor selection" in quantitative investment, which is which factors are used for stock selection, which are generally called "feature selection" in the field of machine learning, specifying which features are used as inputs to the algorithm model.

Here we use the factors EMA value (ema), price volatility (stddev), price slope (slope), RSI value (rsi) and Williams indicator value (wr), here we use the talib package to complete the calculation smoothly, it is a well-known and powerful third-party technical analysis indicator calculation package in the quantitative circle.

data['ema'] = talib.EMA(data['close'].values, timeperiod=20)
data['stddev']= talib.STDDEV(data['close'].values, timeperiod=20, nbdev=1)
data['slope'] = talib.LINEARREG_SLOPE(data['close'].values, timeperiod=5)
data['rsi'] = talib.RSI(data['close'].values, timeperiod = 14)
data['wr'] = talib.WILLR(data['high'].values, data['low'].values, data['close'].values, timeperiod=7)
data.tail(10)

Since we are predicting the rise and fall of the index on the next day, we first calculate the increase (PCT) of each sample on the next day, and if it rises the next day, then set the label (rise) to 1, and vice versa to 0.

Since there are not many anomalies in index data, if there are null values, let's delete the null values.

data['pct'] = data['close'].shift(-1) / data['close'] - 1.0
data['rise'] = data['pct'].apply(lambda x: 1 if x>0 else 0)
#删除缺失值
data = data.dropna()
data.tail(10)

3. Select/build the model

Selecting/building a model is about determining which machine learning model you want to use, whether it is a support vector machine (SVM), a neural network (NN), a random forest (RF), or something else.

As I have said before, we are using the SVM model, the reasons and principles will not be repeated, in order to implement and build the model conveniently, we can import it directly from Scikit-learn (sklearn for short), it is a very popular Python free machine learning library, with a variety of classification, regression and clustering algorithms, generally used with numpy data format.

4. Train the model

Here, we need to split the entire dataset into a training set and a test set, because in addition to training the model, we also need to set aside a part of the data to verify the advantages and disadvantages of the trained model.

Generally speaking, 80% of the samples of the full dataset are used as the training set, and the remaining 20% of the samples are used as the test set.

# 划分训练集和测试集
num_train = round(len(data)*0.8)
data_train = data.iloc[:num_train, :]
data_test = data.iloc[num_train:, :]
# 训练集数据和标签
X_train = data_train[['ema', 'stddev', 'slope', 'rsi', 'wr']].values
y_train = data_train['rise']
# 测试集数据和标签
X_test = data_test[['ema', 'stddev', 'slope', 'rsi', 'wr']].values
y_test = data_test['rise']
print(X_train[:10])
print(45*'-')
print(X_test[:10])

After dividing the data set, there is a very important step, which is to normalize the data, because the numerical dimensions of each factor are too different, for example, the mean value of the exponential EMA is 2919.6, while the mean value of the RSI is 52.7, which will cause the SVM to be "eccentric" for some factors.

Here, the data are normalized using the method "(Raw Value - Mean) / Standard Deviation", after which the mean value of each factor becomes 0 and the standard deviation becomes 1.0.

from sklearn.preprocessing import StandardScaler


print('---标准化之前---')
print('训练集的均值：')
print(X_train.mean(axis=0))
print('训练集的标准差：')
print(X_train.std(axis=0))


# 对数据进行标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


print('---标准化之后---')
print('训练集的均值：')
print(X_train.mean(axis=0))
print('训练集的标准差：')
print(X_train.std(axis=0))

After the data is standardized, you can input the training set data into the SVM, the code implementation is very simple, import the SVM classifier SVC from the svm module of sklearn, create the instance object, stuff the training set factor data and the corresponding label into the fit function, the penalty coefficient of the SVM model uses the default value of 1.0, and the kernel function also uses the default RBF kernel function, the training process is very fast, and the SVM classifier is trained in no time.

from sklearn.svm import SVC


classifier = SVC(C=1.0, kernel='rbf')
classifier.fit(X_train, y_train)
print(classifier)

5. Test the model

At this point, the SVM classifier has been trained, and the prediction value of each sample can be output by stuffing the factor data into the predict function, and the prediction labels of the training set and the test set are inserted back into the original dataset to calculate the prediction accuracy.

y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)
data_train['pred'] = y_train_pred
data_test['pred'] = y_test_pred
accuracy_train = 100 * data_train[data_train.rise==data_train.pred].shape[0] / data_train.shape[0]
accuracy_test = 100 * data_test[data_test.rise==data_test.pred].shape[0] / data_test.shape[0]
print('训练集预测准确率：%.2f%%' %accuracy_train)
print('测试集预测准确率：%.2f%%' %accuracy_test)

Output:

训练集预测准确率：57.52%
测试集预测准确率：52.35%

From the results, it can be seen that the prediction accuracy of the training set is significantly higher than that of the test set, because the entire model is trained on the training set data, and the test set data is still very "unfamiliar", which is equivalent to the college entrance examination mathematics test papers are all produced by the mathematics teachers of your school, and on the whole, your average score is very likely to be higher than that of other schools of the same level.

Just looking at the accuracy is not intuitive enough, let's also see how much benefit can be obtained if you invest purely according to the prediction results of this timing model, and only use the test set for simulation here.

Assuming that the index can be traded long and short, if the model predicts 1 (up), the next day's strategy yield is the index's rise, if the model's prediction is 0 (down), the next day's strategy's return is the opposite of the index's rise, with the daily return, through the dataframe's built-in cumulative function cumprod, you can get the timing strategy and the CSI 300 index's net value curve, for the square (tou) convenience (lan) For the sake of the transaction, the transaction fee is not taken into account, and the transaction is made according to the closing price.

import matplotlib.pyplot as plt


#策略日收益率
data_test['strategy_pct'] = data_test.apply(lambda x: x.pct if x.pred>0 else -x.pct, axis=1)
#策略和沪深300的净值
data_test['strategy'] = (1.0 + data_test['strategy_pct']).cumprod()
data_test['hs300'] = (1.0 + data_test['pct']).cumprod()
# 粗略计算年化收益率
annual_return = 100 * (pow(data_test['strategy'].iloc[-1], 250/data_test.shape[0]) - 1.0)
print('SVM 沪深300指数择时策略的年化收益率：%.2f%%' %annual_return)


#将索引从字符串转换为日期格式，方便展示
data_test.index = pd.to_datetime(data_test.index)
ax = data_test[['strategy','hs300']].plot(figsize=(16,9), color=['SteelBlue','Red'],
                                          title='SVM 沪深300指数择时策略净值  by 量化君')
plt.show()

6. Adjustment parameters

From the test in the previous step, it can be seen that the prediction accuracy of the training set and the test set is only 57% and 52%, which is not ideal, indicating that there is still a lot of room for improvement, and the model can be optimized and improved.

For example, the five factors used now do not reflect the nature of price fluctuations, and more factors can be added and changed.

For example, the penalty coefficient C in the SVM model is too small, and the tolerance for error samples is too high, and the RBF kernel function is not suitable as a mapping conversion function for this dataset.

For example, even the SVM model itself is a parameter that can be changed, for example, by a different machine learning classification model.

That is to say, when it comes to adjusting the parameters, if the results of the trained model are not satisfactory, you can go through the first 5 steps again.

Use machine learning models to build quantitative timing strategies (with full-process code)

Read on