laitimes

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

author:Prose thinks with the wind

In the field of financial investment and quantitative trading, machine learning algorithms are becoming more and more popular and common, because we humans can generally find the linear relationship between cause and effect through observation, and the non-linear relationship is not good, so we must rely on these machine learning algorithms to help us make more informed investment and trading decisions.

In this article, let's talk about the 10 major machine learning algorithms that are commonly used and have good results in financial investment and quantitative trading, and will roughly describe their basic working principles, usage uses and code cases, mainly to play the role of "algorithm list" for quantitative newcomers

一、线性回归(Linear regression)

Linear regression should be said to be the most commonly used statistical model, and the least squares OLS that I was exposed to in high school belongs to linear regression, and you see, this is also an early case of exposure to machine learning.

Linear regression predicts the value of the dependent variable based on one or more independent variables, and it is called "linear" because it assumes that the relationship between the dependent variable and the independent variable is linear, in the form of y=a0+a1*x1+a2*x2+...+an*xn, where x is the independent variable, y is the dependent variable, a is the regression coefficient, and a0 is also called the intercept.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

In the field of financial investment and quantitative trading, linear regression is often used to model and predict financial time series/cross-sectional data, such as security prices, factor returns, exchange rates, and interest rates. It can be used to identify relationships between different variables/factors and make predictions about future values based on these relationships.

The main advantage of linear regression is its simplicity and interpretability, and believe it or not, our common capital asset pricing model (CAPM), arbitrage pricing model (APT) and Fama-French three-factor models are expressed in this form.

In addition to its wide range of applications, linear regression models are very easy to implement, they are relatively fast to train, even on large datasets, and they can also handle missing data and categorical variables by using dummy variables.

Now it is very convenient to use these machine learning algorithms, they are all encapsulated libraries (such as scikit-learn), you can call them directly, let's take a look at the example source code of linear regression.

import numpy as np
from sklearn.linear_model import LinearRegression


# 构建数据样本
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])


# 创建线性回归模型
model = LinearRegression()


# 用数据样本训练模型
model.fit(X, y)


# 用训练好的模型去预测样本
predictions = model.predict(X)


# 打印输出
print('预测:', predictions)
print('系数:', model.coef_)
print('截距:', model.intercept_)           

Output:

预测:[1. 2. 3. 4.]
系数:[0.25 0.25]
截距:0.24999999999999956           

This code is to perform linear regression model fitting on the data in the X array, and the y array is used as the target variable (dependent variable), it can be seen that X has 4 sample points, each sample has 2 features, and will automatically learn 2 regression coefficients and 1 intercept, and the final training model is y=0.25+0.25x1+0.25x2.

It is important to note here that although the above code is simple, the current process of using machine learning libraries is basically the same: first import the required model from the library (LinearRegression), then import the data for training (fit), and finally use the trained model to predict the new data, and if you feel necessary, you can view the key parameters of the model (coef_ and intercept_). The modeling process of all the following algorithm models follows this three-board process and will not be repeated.

However, in general, the dataset used to train the model is called the training set, and the dataset that predicts the trained model is called the test set.

二、逻辑回归(Logistic regression)

Don't see the word "regression" in this algorithm and think that it is used to complete regression tasks like linear regression, but in fact it is mainly used to complete classification tasks. The probability of a binary outcome is usually predicted based on one or more independent variables.

Its model structure is y=1/(1+exp(a0+a1*x1+a2*x2+...+an*xn)), that is, on the basis of linear regression, a layer of Sigmoid function y=1/(1+exp(x)), no matter what the range of the value in exp(*) is, it can limit y to 0~1, and complete the binary classification task well.

In the field of financial investment and quantitative trading, logistic regression is often used for classification tasks, such as whether the price of a security is rising or falling, whether there is fraud in the financial statements, or whether the net profit margin of a company has risen excessively.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

The main advantage of logistic regression, like linear regression, is that it is relatively simple to implement and interpret, and it is very fast to train even on large datasets. In addition, logistic regression is "probabilistic", meaning that it can output predictions in the form of probabilities, rather than just binary predictions, which is very useful in quantitative trading to measure the level of uncertainty or risk.

Let's take a look at the use example of logistic regression, as shown below, from the whole process, it is the same as the linear regression model, except that for the classification task, the label value of y should be changed to 0 or 1 to represent the category.

import numpy as np
from sklearn.linear_model import LogisticRegression


# 构建数据样本
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])


# 创建逻辑回归模型
model = LogisticRegression()


# 训练模型
model.fit(X, y)


# 对数据进行预测量
predictions = model.predict(X)


# 打印预测结果
print(predictions)           

三、决策树(Decision trees)

The reason why a decision tree is called a "decision" tree, as the name suggests, can make predictions based on the characteristics of the data set, and it works by building a tree-like decision model from the data, with each branch representing a different decision or outcome.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

In the field of financial investment and quantitative trading, decision trees are often used for classification tasks. In addition to being relatively easy to understand and interpretive, the main advantage of a decision tree is that it is able to handle complex data sets and identify nonlinear relationships between features and target variables. However, if the decision tree is not properly pruned, it can have problems with overfitting, reducing its generalization ability.

The use case of the decision tree is shown below, the process is the same as the above two algorithms, and it is good to note that it is a classification task.

import numpy as np
from sklearn.tree import DecisionTreeClassifier


# 构建样本数据
X = [[0, 0], [1, 1]]
Y = [0, 1]


# 创建决策树分类器
clf = DecisionTreeClassifier()


# 训练模型
clf = clf.fit(X, Y)


# 打印预测结果
print(clf.predict([[2., 2.]]))           

By the way, you can also specify the type of criterion used to split the tree, such as "gini" or "entropy", and set other parameters such as the maximum depth of the tree or the minimum number of samples required for the leaf nodes, which can be found in the full description in the scikit-learn documentation.

https://scikit-learn.org/stable/modules/tree.html

四、随机森林(Random forests)

Random forests are extensions of the decision trees just described above, based on the principles of ensemble learning, for making stronger and more reliable predictions. It creates a set of decision trees and uses the average of the predictions made by each tree to make the final prediction, similar to how in reality a group of people vote and then the minority obeys the majority.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

Random forests are also generally used for classification tasks, generally have relatively high accuracy, and tend to have better generalization capabilities than individual decision trees. However, random forests are relatively less explanatory than individual decision trees, as predictions are based on the average of many trees, rather than individual trees.

In use, after importing the RandomForestClassifier model from sklearn, you need to set the number of trees in the forest (n_estimators), the maximum depth of the trees (max_depth) and random seeds (random_state), and the rest of the process is the same as the traditional three-plate axe process.

import numpy as np
from sklearn.ensemble import RandomForestClassifier


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 创建随机森林模型
model = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)


# 训练模型
model.fit(X_train, y_train)


# 预测新数据
predictions = model.predict(X_test)


# 打印预测结果
print(predictions)           

It should be noted that load_data () is an abstract function, which needs to be implemented according to its own application, and the main purpose is to load training set data (X_train) and labels (y_train), test set data (X_test) and labels (y_test), and the organizational form is the same as the three algorithms introduced above. Teach you step-by-step, use machine learning models to build quantitative timing strategies" and "Investment managers must develop 4,000 quantitative factors in 3 weeks, teach you 4 lines of core code to easily deal with them" in the data preparation part to build factor data and label the corresponding labels.

If you don't want to organize your own data and want to use it right away, you can directly use the datasets module of the sklearn machine learning library, which comes with the datasets needed for various machine learning tasks, firstly, you can use make_regression or make_classification functions to customize the regression/classification datasets you need, and secondly, you can also use ready-made datasets, such as the diabetes dataset commonly used for regression tasks(load_diabetes), iris dataset for classification tasks (load_iris).

Taking the classification task of random forest as an example, using the iris dataset (load_iris), let's take a look at the specific data form.

import numpy as np
import pandas as pd
from sklearn import datasets


iris = datasets.load_iris()  # 导入鸢尾花数据
data = iris['data'] # 数据
label = iris['target'] # 数据对应的标签
feature = iris['feature_names'] # 特征的名称
df = pd.DataFrame(np.column_stack((data,label)), columns=np.append(feature,'label'))
df.iloc[[0,1,60,61,120,121],:] # 分别展示类别为0、1、2的样本           
Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

iris contains data on 150 irises, each of which has 4 characteristics, namely sepal length, sepal width, petal length, and petal width, as well as corresponding category labels, with labels of 0, 1, and 2 corresponding to the flower varieties Iris Setosa, Iris Versicolour, and Iris Virginica, respectively.

So, to use this dataset in a random forest to train and test a model, you only need to use this code from the application example

# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()           

Modified to:

from sklearn import datasets
from sklearn.model_selection import train_test_split


# 加载训练集和测试集数据
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)           

For the convenience of elaboration and demonstration, the abstract function load_data() is still used to represent the loading of the training set and test set data.

5. K's Nearest Neighbor (KNN)

K-Nearest Neighbors (KNN) is a machine learning algorithm commonly used for classification and regression tasks, which generally works by finding the K data points closest to a given data point and using the categories or numeric values of these data points to make predictions. I believe most people have heard the saying, "Your income is the average income of your 5 friends", and KNN implements this concept of "people are grouped, and things are grouped".

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

In the field of financial investment and quantitative trading, KNN is usually used for classification tasks, but it can also be used for regression. One of the main advantages of KNN is that it is relatively easy to understand and easy to implement, and it is less sensitive to missing values, and the biggest disadvantage is that it is more sensitive to the selection of K values. In addition, when the number of features/factors is too large relative to the sample size, KNN prediction on large datasets may also require a relatively large amount of computation.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 创建KNN模型
model = KNeighborsClassifier(n_neighbors=5)


# 训练模型
model.fit(X_train, y_train)


# 预测新数据
predictions = model.predict(X_test)


# 打印预测结果
print(predictions)           

6. K-means Clustering Algorithm (K-Means)

After talking about KNN, let's talk about its relative K-Means, which looks similar from the name, and the underlying principle is also very similar, but it does a different job.

K-Means is generally used for clustering tasks, while KNN is generally used for classification tasks. Let's post the example code first, and you will find that it is significantly different from other algorithm code.

import numpy as np
from sklearn.cluster import KMeans


# 构建样本数据
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])


# 创建KMeans模型
model = KMeans(n_clusters=2, random_state=0, n_init='auto')


# 训练模型
model.fit(X)


# 预测新数据
predictions = model.predict([[0, 0], [4, 4]])


# 打印预测结果
print(predictions)           

Other algorithms generally need to input feature/factor data X and label data y at the same time when training the model, while K-Means only needs to input feature/factor data X, the former is called supervised learning, and the latter is called unsupervised learning , the former is equivalent to letting you study on your own, both sending you test questions and reference answers, and the latter is only sending test questions, and there are answers.

For example, there are more than 5,000 stocks in A-shares, and Shenwan divides them into 31 industries, you think this division is not appropriate, you can set different attributes/factors for them and how many industries/concepts/styles/plates you want to divide into (for example, the number of clusters in the above example is 2) The K-Means algorithm can divide these more than 5,000 stocks according to their own requirements, and in each cluster, the stocks in it must be extremely similar in terms of attributes/factors you set, and belong to your own industry/concept/style/sector. Many times, you will wonder how stocks that can't be hit by eight poles can be so similar in these ways.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

7. Support Vector Machine (SVM)

Support Vector Machines (SVMs) were originally designed to solve binary classification problems, and later expanded to multi-classification problems, such as the previously developed large-market timing strategy "predicting the rise and fall of the CSI 300 index", which is a typical dichotomous problem, with "up" being one category and "falling" being another.

It linearly distinguishes the two types of samples by finding a maximally spaced hyperplane, and ensures that the distance from the nearest edge point of the two samples to this plane is the largest, because the maximum spaced hyperplane only depends on the edge points of the two categories, these points are called support vectors, which is where the name "support vector machine" comes from.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

If the data points are indivisible in the original space, the low-dimensional indivisible data is mapped to the high-dimensional linear separable, so one of the main advantages of the SVM is that it can process the high-dimensional data and identify the complex relationships between the features and the target variables.

However, SVMs can be computationally intensive to train compared to some other machine learning algorithms, and are less interpretable due to their complex optimization problems. By the way, in order to map data from low-dimensional to high-dimensional, the selection of kernel functions in SVM is very important, and the commonly used kernel functions are linear kernels, polynomial kernels, Gaussian kernels (RBF cores) and Sigmoid cores.

import numpy as np
from sklearn.svm import SVC


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 创建SVM模型
model = SVC(kernel='linear', C=1.0)


# 训练模型
model.fit(X_train, y_train)


# 预测新数据
predictions = model.predict(X_test)


# 打印预测结果
print(predictions)           

八、朴素贝叶斯(Naive Bayes)

Naive Bayes, a machine learning algorithm that makes predictions based on Bayes' theorem, is called "naïve" because it assumes that all features in a dataset are independent of each other, which is certainly not the case in real-world data.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

Despite this assumption, the Naive Bayes algorithm generally works well and is often used in financial investment and quantitative trading. Naive Bayes algorithms are often used for classification tasks, such as predicting the rise and fall of an index on the second day, the positive and negative sentiment of financial news, or the adulteration of financial statements.

One of the main advantages of the naïve Bayesian algorithm is its simplicity and ease of understanding, and even many success articles and books often advocate "Bayesian" thinking. This model is also very fast to train, even on large datasets. Moreover, when the underlying data is affected by some type of noise interference, or when the number of features is too large relative to the number of samples, naïve Bayes algorithms tend to perform well.

import numpy as np
from sklearn.naive_bayes import GaussianNB


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 创建朴素贝叶斯分类器
model = GaussianNB()


# 训练模型
model.fit(X_train, y_train)


# 预测新数据
predictions = model.predict(X_test)


# 打印预测结果
print(predictions)           

九、神经网络(Neural networks)

A neural network is a machine learning algorithm inspired by the structure and function of the human brain, and the model structure consists of a large number of interconnected nodes (neurons) that mimic the mechanisms of the human brain for processing and transmitting information. Neural networks are particularly suitable for tasks involving pattern recognition and prediction, and have been widely used in a variety of fields and scenarios with very significant results.

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

Neural network models are very supportive of both regression and classification tasks, requiring only artificially determining the model structure and then being able to learn, iterate, and adapt to new data without explicit programming, it can also handle large and complex data sets, and it is able to identify nonlinear relationships between features and target variables.

Compared to some other machine learning algorithms, neural networks can be trained with more computationally intensive resources, and are much less interpretable due to their complex network structure, making them very susceptible to overfitting if not properly designed and trained.

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 创建神经网络分类器
model = Sequential()
model.add(Dense(8, input_dim=4, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


# 训练模型
model.fit(X_train, y_train, epochs=20, batch_size=32)


# 预测新数据
predictions = model.predict(X_test)


# 打印预测结果
print(predictions)           

十、集成学习(Ensemble learning)

Ensemble learning does not refer to a specific algorithm, but rather a broad class of algorithms, or ideas or methods for training and combining models. In machine learning, a single model may not perform so well when running, but when multiple models are combined at the same time, "three stinkers on top of each other", it will become very powerful, and this combination of multiple basic models/algorithms is called ensemble learning.

Ensemble learning generally falls into two main categories: Bagging and Boosting. Both methods combine multiple weak models together to form a strong model, but the most obvious difference between them is that multiple weak models in Bagging are trained separately in parallel and then combined together, while Boosting is serial training, and the next weak model is trained according to the "residuals" of the previous weak model, which is used to improve the "short board" of the upper level, and finally combined together. In fact, there are more complex stacking methods and Cascading methods for ensemble learning, so if you are interested, you can explore them yourself.

Bagging Model Structure:

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

Boosting模型结构:

Inventory of 10 Commonly Used and Efficient Machine Learning Algorithms in the Field of Quantitative Trading (with Example Source Code)

Bagging ensemble learning has actually been seen before, that is, random forest, in the field of financial investment and quantitative trading, Bagging is the most used it.

There are many types of boosting ensemble learning, including gradient boosting tree GBDT, adaptive boosting algorithm Adaboost, extreme gradient boosting algorithm XGBoost, and lightweight gradient boosting algorithm LightGBM. In the past, XGBoost combined multi-factor models were commonly used, but now LightGBM is more popular, so let's take LightGBM as an example.

import numpy as np
import lightgbm as lgb


# 加载训练集和测试集数据
X_train, y_train, X_test, y_test = load_data()


# 训练模型
gbm = lgb.train(params={'learning_rate': 0.05,
                        'lambda_l1': 0.1,
                        'lambda_l2': 0.2,
                        'max_depth': 3,
                        'objective': 'multiclass',  
                        'num_class': 3},
                train_set=lgb.Dataset(X_train, label=y_train))


# 预测新数据
predictions = gbm.predict(X_test)
predictions = [list(v).index(max(v)) for v in predictions]


# 打印预测结果
print(predictions)           

The 10 commonly used machine learning algorithms in the field of quantitative trading have been inventoried, I don't know if you have found that no matter how complex the machine learning algorithm, using the ready-made machine learning algorithm library (such as sklearn), according to the "three-board axe" process, ten or twenty lines of code can be modeled, and Ma Ma no longer has to worry about the implementation of our algorithm.

This is the purpose of this article, most of the time you only need to focus on quantitative task dismantling and data sorting, whether it is a classification task (such as predicting the rise and fall), or a regression task (such as predicting the rise and fall), and then organize the feature/factor data, according to the index of this "list of machine learning algorithms", use the corresponding algorithm according to the code example as needed.

Content Sources/References:

Christophe Atten,2022.12,《Top 10 machine learning algorithms in Finance》

Chainika Thakar,2023.01,《Top 10 Machine Learning Algorithms For Beginners》

Heart of the Machine, 2019.03, "Top 10 Algorithms for Machine Learning"

Kai Ge, 2020.10, "Integrated Learning"

Zhihua Zhou, 2016.01, Machine Learning

Read on