laitimes

Matrix decomposition can beat deep learning MIT publishes time series database tspDB: machine learning with SQL

Reporting by XinZhiyuan

EDIT: LRS

Time series forecasting problems are often trickier than ordinary machine learning, requiring not only the maintenance of an incremental database, but also the performance of real-time predictions. Recently, researchers at MIT released a database that can create machine learning models through SQL, so you don't have to worry about time series data management anymore!

The only lesson that mankind has learned from history is that mankind cannot learn any lesson from history.

"But machines can learn." - Wozkishod

Whether it's predicting tomorrow's weather, predicting future stock prices, identifying suitable opportunities, or estimating a patient's risk of disease, time series data may be interpreted, and data collection is a record of observations over a period of time.

But using time series data for prediction often requires multiple steps of data preprocessing and requires the use of complex machine learning algorithms, and it is not easy for non-professionals to understand the principles and usage scenarios of these algorithms.

Recently, researchers from MIT developed a powerful system tool, tspDB, that facilitates users' processing of time series data, with the ability to integrate forecasting capabilities directly on top of existing time series databases. The system contains many complex models that even non-experts can make a prediction in a matter of seconds. When performing the tasks of predicting future values and filling in missing data points, the new system is more accurate and efficient than state-of-the-art deep learning methods. The paper was presented at the ACM SIGMETRICS conference.

Address of the paper: http://proceedings.mlr.press/v133/agarwal21a/agarwal21a.pdf

The main reason for the performance improvement of tspDB is that it uses a novel time series forecasting algorithm that is particularly effective in predicting multivariate time series data. Multivariate refers to data that has more than one time-dependent variable, for example in a weather database, the current values for temperature, dew point, and cloud cover all depend on their respective past values.

The algorithm can also estimate the volatility of multivariate time series in order to provide users with the satisfaction of the model's prediction accuracy

The authors say that even as time series data becomes more complex, the algorithm can effectively capture time series structures.

Author Anish Agarwal, Ph.D., graduated from MIT with key research interests including the interaction of causal reasoning and machine learning; high-dimensional statistics; and data economics. Joined the Simmons Institute at the University of California, Berkeley in January 2022 as a postdoctoral fellow.

The correct posture for processing time series data

A major bottleneck in today's machine learning workflows is that data processing is too time-consuming and intermediate processes are prone to errors. Developers need to get data from data stores or databases, and then apply machine learning algorithms to train and predict, which requires a lot of manual data processing.

This is getting worse now, as machine learning needs to swallow more and more data, and it's even harder to manage. Especially in the field of real-time forecasting, especially in various time series application scenarios, such as finance and real-time control, it is more necessary to manage data well.

If you can make predictions directly on the database, won't you save the step of fetching data?

But this kind of prediction integration system on the database not only needs to provide an intuitive prediction query interface to prevent duplicate data engineering; it also needs to ensure that the accuracy can reach sota, support incremental model updates, short training time and low prediction latency.

tspDB is directly integrated with PostgreSQL, and natively supports multiple machine learning algorithms, such as generalized linear models, random forests, neural networks, and can also adjust hyperparameters in the database when training models.

Unlike other databases, an important starting point for tspDB is how the "end user" interfaces with the system to obtain predicted values.

In order to make the interface of machine learning more versatile, tspDB takes a different approach: abstracting the machine learning model from the user, and striving to use only a single interface to respond to standard database queries and prediction queries, that is, using SQL to query.

In tspDB, predictive queries take the same form as standard SELECT queries. The difference between predictive queries and normal queries is that one is model prediction and the other is retrieval.

For example, there are only 100 pieces of data in the database, and if you want to predict the value of day 101, you can use the PREDICT keyword, WHERE day = 101; and when WHERE day = 10, it will be parsed the estimated value/denoising value of the stock price on the 10th day, so PREDICT can also be used to predict missing values.

In order to implement THE PREDICT query, the user needs to build a prediction model using the existing multivariate time series data first. The CREATE keyword can be used to build a predictive model in tspDB, and the input features can also be multiple columns of data.

Compared with PostgreSQL DB, the time required to create a predictive model in tspDB on a standard multivariate time series dataset is 0.58-1.52 times the time it takes for PostgreSQL to insert in batches. In terms of query latency, the time it takes to answer a PREDICT query in tspDB is 1.6 to 2.8 times that of answering a standard PREDICT query, which is 1.6 to 2.8 times higher than answering a standard SELECT query.

In absolute terms, this equates to 1.32 milliseconds to answer a SELECT query, 3.5 milliseconds to answer a prediction query, and 3.36/3.45 milliseconds to answer an inductive/predictive query.

That said, tspDB's computational performance is close to the time it takes to insert and read data from PostgreSQL, and can basically be used for real-time prediction systems.

Because tspDB is just a proof-of-concept, equivalent to an extension of PostgreSQL, users can create prediction queries on single or multiple columns; create single or multiple column prediction queries on time series relationships, and provide estimates of prediction intervals. Most importantly, the code is open source.

Code link: https://github.com/AbdullahO/tspdb

The paper also proposes a matrix decomposition algorithm based on the time series algorithm, which can be decomposed by stacking the multivariate time series data Page Matrix, removing the last column as the predicted value in the sub-matrix, and using linear regression to predict the target value.

For the constant influx of time series data, the algorithm also supports incremental model updates.

To test the algorithm's performance, the researchers selected three real-world datasets, including Electricity, Traffic, and Finance. The evaluation index uses the Normalized Root Mean Square Error (NRMSE) as the accuracy rate. In order to quantify the statistical accuracy of the different methods, the researchers also added a variant of the standard Borda Count (WBC) as an evaluation indicator, a value of 0.5 means that the algorithm's performance is average compared to other algorithms, 1 means that it has an absolute advantage over other algorithms, and 0 represents an absolute disadvantage.

Comparing tspDB's predictive performance with the most popular time series libraries in academia and industry such as LSTM, DeepAR, TRMF, and Prophet shows that tspDB performs similarly with deep learning algorithms (DeepAR and LSTM) and surpasses TRMF and Prophet.

When changing the proportion of missing values and added noise, tspDB was the best performing method in 50% of experiments and at least second best in 80% of experiments. Using both WBC and NRMSE metrics, tspDB outperforms all other algorithms in the power and financial data sets, and rivals DeepAR and LSTM in traffic data sets.

On variance estimation, because we can't get the true underlying time-varying variance in real-world data, the researchers limit the analysis to synthetic data. Synthetic Dataset II consists of nine multivariate time series, each with a different combination of dynamic additive time series and different noise observation models (Gaussian, Poisson, Bernoulli noise).

The results of the experiment can be found that tspDB has higher performance (>98%) in all experiments than TRMF and DeepAR (for prediction).

Overall, these experiments show the robustness of tspDB, i.e. the effects of partial noise can be eliminated when estimating the mean and variance of time series.

Resources:

https://news.mit.edu/2022/tensor-predicting-future-0328

Read on