Building a Modern Deep Learning Framework from Scratch (TinyDL-0.01)

Guide

From the perspective of a Java engineer, this article explains how to build a minimalist (but small) modern deep learning framework (an operating system analogous to AI) from scratch (without any dependencies on two or three parties). The background section here mainly answers two questions that are often encountered by souls: 1) why do you do it, and 2) what is the difference between what you do and what others do?

Why write a deep learning framework?

1.1. Learn to embrace AI

AI is a big trend and a long track, and this time it's different. At the double inflection point, Javaer's dilemma needs to be transformed to adapt to this change. At the end of the previous article "Some Practices on AIGC Interaction", I talked about my feelings:

In recent years, there has been a double inflection point in the market and technological environment: on the one hand, the development of the consumer Internet has completely entered the era of stock game, and incremental dividends are no longer there, and cost reduction and efficiency increase have become the main theme; on the other hand, the latest development of AI technology has made major breakthroughs, and the efficiency and cost in many aspects have clearly surpassed that of human beings. The new era and new situation have also forced Javaers like me to transform and embrace new technologies.

There are two new directions for technology: one is the transformation of metaverse-related technologies represented by 3D, such as blockchain, XR and other technologies, and the other is the transformation of AI represented by large models. Recently, I started to learn about deep learning, so as a Java coder, in order to understand network models such as CNN, DNN, and RNN, I am still making up linear algebra, calculus, and PyTorch. Although it is said that "life is short, I use python", but the imaginative style of Javaer and Python, which wants to learn the implementation of deep learning frameworks by looking at the source code, really makes me feel that life is too short (there is no meaning of language debate here).

The great physicist Feynman said, "What I cannot create, I cannot understand." I guess he's right, so I wanted to give it a try.

1.2. just for fun，向linux-0.01 致敬

On September 17, 1991, Linnas, a 21-year-old Finnish student, released the open-source operating system Linux-0.01 [1] on the Internet, with clean code and clear directories. In line with the practice of Dong Shi, I also deliberately set the version number to 0.01, TinyDL-0.01 (Tiny Deep Learning)[2], which can be regarded as a tribute to linux-0.01. Share it with Javaer who is interested in AI, you can understand the principle and simple implementation of deep learning from the perspective of underlying engineering, and interested partners can learn from each other together.

2. How is it different from someone else's?

There are many deep learning frameworks, most of which are written in Python (C/C++ at the bottom layer) such as TensorFlow, PyTorch, and MXNet. There are two well-known deep learning frameworks implemented in Java: one is DeepLearning4J[3] maintained by the Eclipse open source community, and the other DJL[4] is an open source deep learning Java framework from AWS.

DeepLearning4J is a full-stack implementation, which is obviously difficult to learn through code because it is too complex and huge (65% Java 69.7w lines, 24% C++, 3.4% Cuda, etc.) and relies on too many complex third-party libraries for scientific computing.

DJL is just a high-level Java interface for deep learning, with no real implementation, and ultimately runs on TensorFlow or PyTorch deep learning engines.

TinyDL-0.01[5], as its name suggests, is a lightweight deep learning framework (Java implementation) with minimal implementation. Compared with other implementations: 1) minimalist, basically zero-two-party dependencies (I have a reason not to write a single test in order not to introduce Junit), 2) full-stack, from the lowest tensor operation to the top-most application cases, 3) hierarchical and easy to extend, each layer of implementation includes core concepts and principles and clear layer boundaries. Of course, the shortcomings are more obvious: poor functionality and poor performance.

Although TinyDL-0.01 is only a tony-level framework, it tries to have the basic characteristics of modern deep learning frameworks (dynamic computing graphs, automatic differentiation, multi-optimizer, multi-type network layer implementation, etc.), and focuses on a simple and straightforward framework, which is mainly used for introductory learning. If you want to deeply understand deep learning by looking at the code of frameworks such as PyTorch, you are basically discouraged.

First, the overall architecture

The deep learning framework mainly solves the engineering problems of deep network training and inference, including the complexity of multi-layer neural networks, the computational efficiency of a large number of matrix operations and parallel computing, and the scalability of supporting multiple computing devices. Commonly used deep learning frameworks include: TensorFlow, an open-source framework developed by Google, PyTorch, an open-source framework developed by Facebook, MXNet, an open-source framework developed by Amazon, and many others. After years of development, their architectures and functions are slowly converging, let's take a look at what are the common capabilities of the next modern deep learning framework.

1. What does a deep learning generic architecture look like?

Let's take a look at how chatGPT answers this question:

Here is also a specific reference to the most popular deep learning framework PyTorch, which is roughly divided into four layers (from Zhihu):

2. What is the overall architecture of TinyDL?

TinyDL adheres to the principle of concise and clear layering, and refers to the general layering logic, and the overall structure is as follows:

Maintain a strict layering logic from bottom to top:

1. ndarr package: the core class NdArray, a simple implementation of the underlying linear algebra, currently only implements the CPU version, and the GPU version needs to rely on a huge third-party library.

2. func package: the core classes Function and Variable are abstract mathematical functions and variables respectively, which are used to automatically build computational graphs and realize automatic differentiation functions when propagating forward, where Variable corresponds to the tensor of PyTorch.

3. Nnet package: The core classes Layer and Block represent the layers and blocks of the neural network, and any complex deep network relies on the stacking of these layers and blocks. It implements some commonly used CNN layers, RNN layers, norm layers, and the Seq2Seq architecture of Encode and Decode.

4. mlearning package: the representation of the general components of machine learning, deep learning is a branch of machine learning, corresponding to a wider range of machine learning has a set of general components, including datasets, loss functions, optimization algorithms, trainers, derivators, effect evaluators, etc.

5. modality package: belongs to the category of the application layer, at present, deep learning mainly applies the three parts of the task of visual graphics, natural language processing and reinforcement learning, and there is no corresponding field implementation for the time being, hoping to realize GPT-2 and other prototypes in version 0.02.

6. example package: some simple examples that can be run through, mainly including classification and regression problems of machine learning, including curve fitting, spiral curve classification, handwritten digit recognition and sequence data prediction.

Next, let's briefly outline the core concepts and simple implementations involved in each layer in a full-stack manner, from bottom to top.

2. Linear algebra and tensor operations

Let's start with the first layer of deep learning: the tensor manipulation layer. The manipulation and computation of tensors (multidimensional arrays) is the basis of deep learning, and almost all operations are based on tensors (a single number is called a scalar, a one-dimensional number is called a vector, a two-dimensional array is called a matrix, and a three-dimensional array and above are called N-dimensional tensors). These operations are usually implemented using efficient numerical computing libraries, implemented on specific computing hardware via the C/C++ language, and provide a variety of tensor operations such as matrix multiplication, convolution, pooling, etc.

This part is divided into three main parts, starting with some basic knowledge of linear algebra, then CPU-based minimization implementation, and finally comparing why deep learning relies heavily on a new computing paradigm, GPU.

1. Basic linear algebra

Let's take a picture first to directly persuade everyone to quit, although when I was going to graduate school, these were only the most basic eight-strand practice questions:

Then there are some common concepts of linear algebra:

Vector: A vector is a quantity with a size and direction, and in linear algebra, a vector is usually represented by a column of numbers.

Matrix: A matrix is a two-dimensional array of rows and columns, which can be used to represent a system of linear equations or linear transformations.

Vector space: A vector space is a set of vectors that satisfies some specific properties, such as closure, combination of addition and multiplication of quantities, etc.

Linear Transformation: A linear transformation is an operation that maps one vector space to another. It preserves linear combinations and collinear relationships.

System of Linear Equations: A system of linear equations is a set of linear equations in which each equation satisfies a variable by 1 degree and has a linear relationship.

Eigenvalues and eigenvectors: In a matrix, the eigenvalue is a scalar and the eigenvector is a non-zero vector, and the product of the satisfying matrix and this vector is equal to the eigenvalue multiplied by that vector.

Inner product and outer product: The inner product is an operation between vectors to measure the angle and length between them, and the outer product is an operation between vectors to generate a new vector that is perpendicular to the original vector.

Determinant: A determinant is a scalar value that is composed of elements of a square matrix according to specific rules, and it is used to calculate the inverse of a matrix, judge the parity of a matrix, etc.

But if you see the simple implementation of the CPU version [6], you will also feel that it is so simple (currently only scalars & vectors & matrices are supported, and higher dimensional tensors are not supported for the time being).

2. Simple implementation of the CPU version

/**

 * 支持，1，标量;2.向量;3，矩阵，

 * <p>

 * 暂不支持，更高纬度的张量，更高维的通过Tensor来继承简单实现

 */

public class NdArray {

protected Shape shape;

/**

     * 真实存储数据，使用float32

     */

private float[][] matrix;

}

/**

 * 表示矩阵或向量的形状

 */

public class Shape {

/**

     * 表示多少行

     */

public int row = 1;

/**

     * 表示多少列

     */

public int column = 1;

public int[] dimension;

}

In fact, they are all operations on two-dimensional arrays: there are roughly three categories, the first is some simple initialization functions, the second is the basic four operations addition, subtraction, multiplication and division, and some operations will change the shape of the matrix, of which the inner product is the longest.

3. Why do you need a GPU?

Through the above matrix operation on the CPU, especially a simple inner product operation requires multi-layer for loop nesting, we all know that the architecture designed for logic control transfer on the CPU is actually not a good way to implement parallel operations. The rows and columns of matrix operations can actually be paralleled, so the matrix operations that deep learning relies on are extremely inefficient on the CPU. For a more visual comparison, you can refer to the figure below, compared to the CPU, the GPU has a weak control logic unit (blue cell), but has a large number of ALUs (arithmetic logic green cell).

Most deep learning frameworks (such as TensorFlow, PyTorch, etc.) provide support for GPUs, which can be easily used for parallel computing. With the explosion of chatGPT this year, GPU has become the infrastructure of AI and has quickly become a new mainstream computing paradigm.

3. Computational graphs and automatic differentiation

Now we come to the second layer of the deep learning framework: the FUNC layer, which mainly implements the very important features of the deep learning framework, such as computational graphs and automatic differentiation.

1) A computational graph is a graphical representation used to describe the flow of data and the dependencies of operations in the calculation process. In deep learning, the forward propagation and backpropagation processes of a neural network can be represented by computational graphs.

2) Automatic differentiation is a technique for calculating derivatives and is used to calculate the derivative or gradient of a function. In deep learning, the backpropagation algorithm is an automated differentiation method used to calculate the gradient of each parameter in a neural network with respect to the loss function.

Through the combination of computational graphs and automatic differentiation, the gradient of a large number of parameters in complex neural networks can be efficiently calculated, so as to realize the training and optimization of models.

1. Numerical and analytic differentiation

1.1. Numerical Differentiation

The derivative is the slope of the function image at a certain point, that is, the ratio of the ordinate increment (Δy) and the abscissa increment (Δx) at Δx->0. Differentiation refers to the increment obtained by the ordinate of the tangent at a certain point of the function image after the increment Δx is obtained in the abscissa, which is generally expressed as dy.

Numerical differentiation is a method of approximating the derivative of a function by numerical methods, the purpose of which is to estimate the derivative of a function by calculating the finite difference of the function near a certain point. The central difference is used to solve the problem by approximating the derivative of the function at a certain point, and using the function value of a point before and after the function at that point, the formula is as follows: f'(x) ≈ (f(x + h) - f(x - h)) / (2h). where h is the step size of the difference, and the smaller the step size, the more accurate the calculation result. Numerical differentiation is an approximate calculation method, and there is a certain error between the calculated result and the real derivative value.

1.2. Analytic differentiation

Analytic differentiation is another method in calculus that is used to accurately calculate the derivative value of a function at a certain point. It solves derivatives by applying the definition of derivatives and basic rules of differentiation. Function expressions can be determined based on the definition of functions. For example, given a function f(x), you need to determine its expression, such as f(x) = x^2 + 2x + 1. For example, here are the analytic differentiations of some commonly used functions:

2. Calculate the graph

A computational graph is defined as a directed graph in which nodes correspond to mathematical operations, and a computational graph is a way of expressing and evaluating mathematical expressions. For example, there is the following equation g=(x+y)∗z, and the calculation plot of the above equation is plotted.

The computational graph has an addition node (a node with a "+" sign) and a multiplication node with three input variables, xyz, and an output g.

Now let's look at the next more complex function:

The following function f(x,y) is calculated as follows:

The advantage of computational graphs is that they can clearly represent complex function calculation processes, and it is convenient for backpropagation and parameter updating. In deep learning, computational graphs are often used to build neural network models, where each node represents the layers or operations of the neural network, and edges represent the flow of data between the layers of the neural network. By constructing a computational graph, the complex function calculation process can be decomposed into a series of simple operations, and the backpropagation algorithm can be used to calculate the gradient of each node, so as to realize the optimization and training of model parameters.

3. Backpropagation

Find a more vivid example on the Internet to illustrate backspread: Suppose we want to buy fruit now, in our daily thinking it is a very simple thing, calculate the price and give the money and it's done, but in fact the process can be abstracted into a calculation diagram, which contains several steps. For example, in the figure below, you need to calculate the price of apples by multiplying the number and then multiplying by the consumption tax, and finally get the real payment amount.

The above process is called forward propagation, and the reverse propagation is also easy to understand, which literally means that the direction is opposite, but in fact the calculation method is slightly different. Suppose that in the example of buying fruit in the picture above, we want to know what is the derivative of the apple for the final price, how should we calculate it?

Forward propagation is to calculate the final result of the model, so what is the purpose of backpropagation? Of course, it is to obtain the influence coefficient of each parameter in the model on the result, so that the parameters can be adjusted according to this coefficient to make the model result better. Obviously, you need to reverse the calculation layer by layer, that is, multiply the coefficients layer by layer, in fact, the coefficients here are the derivatives of each layer.

Backpropagation is represented by an arrow (thick line) in the opposite direction to the positive direction, and backpropagation passes a "local derivative" with the value of the derivative written below the arrow. In this example, backpropagation passes the value of the derivative (1→1.1→2.2) from right to left. From this result, we can see that the value of the "Derivative of the amount paid with respect to the price of apples" is 2.2. This means that if the price of an apple rises by 1 yen, the final payment amount will increase by 2.2 yen (strictly speaking, if the price of an apple increases by a certain tiny value, the final payment amount will increase by 2.2 times that tiny value).

chain rule

The chain rule is an important theorem in calculus that is used to find the derivative of a composite function, the partial derivative is the partial differentiation of a multivariate function to one of the variables, and the chain rule is also applicable to the partial derivative of a multivariate function. Suppose there are two functions: y = f(u) and u = g(x), where y is a function of x. Then according to the chain rule, the derivative of y to x can be calculated by finding the product of the derivative of f to u and the derivative of g to x. Specifically, the chain rule can be expressed as: dy/dx = (dy/du) * (du/dx). To calculate the derivative of a complex function, the derivative is chained. In fact, a neural network can be regarded as a complex function in nature, that is, a derivative of a complex function.

It is difficult to explain it simply, after all, it is the most difficult in the general education course of the university, and many students in the first year of college are hanging on this tree - high mathematics, interested students can refer to "Mathematics of Deep Learning" [7].

4. Design and implementation of the func layer

The core classes of the func layer are Function and Variable, which correspond to an abstract mathematical function form y=f(x), where x and y represent Variable variables, and f() represent a function Function. The implementation of each specific function requires the implementation of two methods: forward and backward.

/**

     * 函数的前向传播

     *

     * @param inputs

     * @return

     */

    public abstract NdArray forward(NdArray... inputs);




    /**

     * 函数的后向传播，求导

     *

     * @param yGrad

     * @return

     */

    public abstract List<NdArray> backward(NdArray yGrad);

For example, a Sigmoid function would be implemented as follows:

public class Sigmoid extends Function {

    @Override

    public NdArray forward(NdArray... inputs) {

        return inputs[0].sigmoid();

    }




    @Override

    public List<NdArray> backward(NdArray yGrad) {

        NdArray y = getOutput().getValue();

        return Collections.singletonList(yGrad.mul(y).mul(NdArray.ones(y.getShape()).sub(y)));

    }




    @Override

    public int requireInputNum() {

        return 1;

    }

}

Variable is implemented as follows and represents an abstraction of variables in mathematics, where backward is the entry function for backpropagation.

**

 * 数学中的变量的抽象

 */

public class Variable {

    private String name;




    private NdArray value;




    /**

     * 梯度

     */

    private NdArray grad;




    /**

     * 保持是函数对象生成了当前Variable

     */

    private Function creator;




    private boolean requireGrad = true;




    /**

     * 变量的反向传播

     */

    public void backward() {}

}

The implementation of the entire func layer is as follows:

4. Neural Networks and Deep Learning

A neural network consists of a network of multiple neurons that processes and learns information on the basis of simulating the nervous system of a living organism. The design of neural networks is inspired by the interactions between neurons in the human brain. In a neural network, each neuron receives input from other neurons and makes a weighted sum based on the weight of the input, then passes the result to the activation function for processing and produces an output. The learning process of a neural network is usually achieved by adjusting the weights of connections between neurons in the network, allowing the network to make predictions and classifications based on the input data. The figure below shows a representation of both biological and mathematical models of neurons.

A neural network is a hierarchy of multiple nodes, each of which processes the input data through the computation of a weighted and nonlinear activation function and passes the results to the next level node. Deep learning, on the other hand, is a deep neural network that uses multiple hidden layers on top of a neural network. Neural networks are the basic models of deep learning, and deep learning introduces a multi-layer network structure on the basis of neural networks, which can automatically learn more abstract and high-level feature representations.

1. Error backpropagation algorithm

In 2006, Hinton et al. proposed a deep neural network model based on unsupervised learning. The training of the model solves the problem of the disappearance of the ladder through a new method, and the layer-by-layer training breaks the dilemma that the deep network is difficult to train in the past, lays the foundation for the development of deep learning, and creates deep learning. However, modern deep learning is still trained by error backpropagation algorithms, mainly due to the introduction of some new activation functions, the improvement of regularization parameter initialization methods, and the high efficiency of gradient descent training for the whole network.

The backpropagation algorithm is the main method for training neural networks, which is based on gradient descent, by calculating the gradient of the loss function to the network parameters, and then adjusting the network parameters according to the opposite direction of the gradient, so that the output of the network is closer to the real value. The backpropagation algorithm uses the chain rule to transfer the error between the output of the network and the true value back to the input layer of the network layer by layer, and calculates the gradient of the parameters of each layer. The details are as follows:

Forward propagation: The input sample is passed through the forward computation process of the neural network to obtain the output value of the network.
Calculate the loss function: The difference between the output value of the network and the real value is used as the input of the loss function to calculate the error between the predicted value of the network and the true value.
Backpropagation: Based on the value of the loss function, the gradient of each parameter to the loss function is calculated layer by layer. Through the chain rule, the gradient of the previous layer is multiplied by the derivative of the activation function of the current layer to the input to obtain the gradient of the current layer, which is passed to the previous layer, and calculated layer by layer until the input layer.
Update network parameters: Using a gradient descent algorithm, the value of each parameter is updated in the opposite direction of the gradient so that the loss function gradually decreases.
Repeat the above steps until the training stops are met, or the maximum number of iterations is reached.

The core idea of the backpropagation algorithm is to update the network parameters layer by layer by calculating the gradient of each layer of parameters, so that the network can approximate the true value.

A common challenge faced by error backpropagation algorithms in deep neural networks is the gradient vanishing problem. In order to solve the gradient vanishing problem, many methods have been proposed: 1) activation function selection, using nonlinear activation functions, such as ReLU (Rectified Linear Unit) or Leaky ReLU, can help alleviate the gradient vanishing problem. 2) Weight initialization, a proper weight initialization can help avoid gradient vanishing.

For example, using a smaller variance to initialize the weights keeps the gradient a reasonable size. 3) Batch Normalization, a technique for normalizing on each small batch of data, helps stabilize the gradient and accelerates network training. 4) Residual Connections, which are a technique of hopping connections that allow activations and gradients to propagate directly through the network. 5) Gradient clipping is a technique to limit the gradient size, by setting a gradient threshold, you can prevent gradient explosion, and reduce the problem of gradient vanishing to a certain extent. These methods can be used individually or in combination to help solve the problem of vanishing gradients and have contributed to the explosion of deep learning.

2. Accumulation of layers and blocks

In order to implement these complex networks, the concept of neural network blocks is generally introduced. A block can describe a single layer, a component made up of multiple layers, or the entire model itself. One of the benefits of using blocks for abstraction is that some blocks can be combined into larger components, a process that is often recursive. By defining code to generate blocks of arbitrary complexity on demand, we can implement complex neural networks with concise code.

public interface LayerAble {


    String getName();


    Shape getXInputShape();


    Shape getYOutputShape();


    void init();


    Variable forward(Variable... inputs);


    Map<String, Parameter> getParams();


    void addParam(String paramName, Parameter value);


    Parameter getParamBy(String paramName);


    void clearGrads();


}


/**
 * 表示由层组合起来的更大的神经网络的块
 */
public abstract class Block implements LayerAble 


/**
 * 表示神经网络中具体的层
 */
public abstract class Layer extends Function implements LayerAble

The following figure is the overall class diagram of the nnet layer, the core is around the implementation of Layer and Block, where Block is the relationship between directories and files when Layer is the container class of Layer, and the others are implemented around them, each Layer or Block implementation is a well-known academic paper, and there are deep mathematical derivations behind it (explaining why it is effective to add this type of layer to the network):

5. Machine learning and modeling

Let's sort out the relationship between machine learning and deep learning, as follows, deep learning is only a branch of neural network development, and deep learning is only a branch of machine learning and a special case of a model.

Corresponding to the broader machine learning, there is a set of common components, including dataset loss function, optimization algorithm, trainer, derivator, effect evaluator, etc. In tinyDL, the general components of machine learning are not strongly bound together with deep learning, and are implemented as a separate layer, which is also convenient for the subsequent expansion of more non-neural network models, such as random forests, support vector machines, etc.

As shown in the following figure, it is the overall implementation of mLearning:

1. Datasets

The DataSet component is positioned to load and preprocess the data into a data format that the model can learn. At present, some simple data source implementations are based on the implementation of data that can be loaded into memory (ArrayDataset) at one time, such as SpiralDateSet and MnistDataSet.

2. Loss function

The loss function is used to measure the difference between the predicted value of the model and the actual value, or the prediction error of the model. It is the objective function of model optimization, which makes the prediction of the model closer to the actual value by minimizing the loss function.

The choice of loss function depends on the type of problem being solved, such as a classification problem, a regression problem, or another task. Common loss functions are: 1) Mean Squared Error (MSE): Used in regression problems to calculate the mean of the squared difference between the predicted and true values. 2) Cross Entropy: Used for classification problems, comparing the difference between the probability distribution of the predicted class and the distribution of the true class. 3) Log loss: It is also used for classification problems, based on the principle of logarithmic likelihood, to measure the difference between the predicted probability of a binary classification model and the true label, etc. The selection of loss function should match the model task and data characteristics, and the appropriate loss function can provide better model performance and training effect.

At present, tinyDL implements the most commonly used MSE and softmaxCrossEntropy, where SoftmaxCrossEntropy transforms the regression problem into a classification problem by mapping the regression output results to the probability distribution of the category. With SoftmaxCrossEntropy, regression problems can be transformed into multi-classification problems, and models can be trained and optimized by minimizing the SoftmaxCrossEntropy loss function. In the prediction phase, the category with the highest probability can be selected as the prediction result based on the probability distribution.

3. Optimizer

Commonly used optimizers in machine learning are as follows: 1) Stochastic Gradient Descent (SGD), which uses only one sample at a time to calculate the gradient and update parameters, which is faster than gradient descent. 2) Batch gradient descent method, which uses the entire training set to calculate the gradient and update parameters each time, has a relatively slow convergence speed but good stability. 3) Momentum optimizer, which accelerates the update process of gradient descent by introducing a momentum term, can take into account the previous gradient change when updating the parameters, thereby reducing oscillations. 4) AdaGrad, which adaptively adjusts the learning rate according to the historical gradient of the parameters, which will reduce the learning rate for frequently occurring features and increase the learning rate for sparse occurrences. 5) Adam, an optimizer that combines the advantages of momentum optimizer and RMSProp, Adam optimizer usually has a faster convergence speed and better performance, etc. Currently, tinyDL implements the most commonly used SGD and Adam:

6. Application tasks and small tests

Deep learning can be applied in many fields, especially in computer vision, such as deep learning in image classification, object detection, object recognition, face recognition, image generation, etc., as well as natural language processing, applied to machine translation, text classification, sentiment analysis, semantic understanding, question answering systems and other fields, such as intelligent assistants, social media analysis, etc. Among them, the two main directions of AIGC for current large models are text2Image and text2text, and the main model architectures are stblediffusion and transformer. These rely on strong and complete basic capacity building, and tinyDL-0.01 can only support the training and derivation of some small models. As follows:

1. Fitting of straight lines and curves

2. Spiral data classification problem

In the third figure, you can see that the model can learn very clear block boundaries.

3. Handwritten numerical classification problems

In deep learning, the handwritten number classification problem is a classic problem, which is often used to introduce and learn deep learning algorithms, and the goal of this problem is to correctly classify the handwritten digital image into the corresponding number.

After a simple 50 rounds of training, the loss was reduced from 1.830 to 0.034. On the test dataset, the prediction was 96% accurate (the best accuracy in the world would be 99.8%).

4. Fitting the cos curve by RNN

The calculation diagram of the RNN is as follows:

When the recursive hidden layer window is 3, the graph is calculated:

When the recursive hidden layer window is 5, the graph is calculated:

7. Summary

1. TinyDL is just a demo framework that is friendly to Javaer for getting started with AI

I have always believed that clean code will speak for itself, and strive to be documented by the code itself. So much has been written above, in fact, the best way is to debug it directly. At the same time, TinyDL is only a demo-level deep learning framework for Javaer to learn AI, which is not suitable for use in the real environment at present, but hopes to play a little role in Java programmers embracing AI. As far as the current trend is concerned, Python's ecological advantages in the field of AI have been very obvious, and Python has been a mathematically expression-friendly language from the beginning, and it has a large number of overloaded operators and is more mathematically readable. Therefore, if you want to really use the capabilities of AI, Python should be a hurdle that cannot be bypassed.

2. TinyDL-0.02 plans to make up for todo's capabilities and support some advanced network features

TinyDL-0.02 hopes to implement the tansfromer prototype based on the seq2seq framework, and the model training is currently the simplest single-threaded operation, and the prototype of distributed network training of the parameter server will be implemented in the future.

3. Using chatGPT to assist in writing code can really improve efficiency

About 1/3 of the code in TinyDL is done with the help of chatGPT, which has become popular recently: programmers will become their last gravediggers, and the gears of coder's fate will begin to reverse after chatGPT.

Reference Links:

[1]https://github.com/Leavesfly/linux-0.01/blob/master/README

[2]https://github.com/Leavesfly/TinyDL-0.01

[3]https://github.com/deeplearning4j/deeplearning4j

[4]https://github.com/deepjavalibrary/djl

[5]https://github.com/Leavesfly/TinyDL-0.01

[6]https://github.com/Leavesfly/TinyDL-0.01/blob/main/src/main/java/io/leavesfly/tinydl/ndarr/NdArray.java

[7] Mathematical .pdf https://github.com/jash-git/Jash-good-idea-20200304-001/blob/master/CN%20AI%20book/ deep learning

Author: Yamazawa

Source: WeChat public account: Alibaba Cloud developer

Source: https://mp.weixin.qq.com/s/tFl0RQd3ex98_SAOIIfM_Q

Building a Modern Deep Learning Framework from Scratch (TinyDL-0.01)

1.1. Numerical Differentiation

1.2. Analytic differentiation

chain rule

Read on