
In recent years, under the impetus of national policies, big data has been deeply integrated with various industries, relying on big data, artificial intelligence, blockchain, industrial Internet and other digital economy industries to flourish.
Big data is an important foundation of the digital economy and contains great potential value. Especially for industrial enterprises, it is particularly important for the future development of enterprises to use the advantages of digital "linking, circulation, simulation, feedback, and integration" to achieve digital transformation and upgrading.
"If you want to do a good job, you must first use it", how to do a good job in the "operation and management" of data, and release the value of the data itself, which is an important test for the data processing ability of the big data platform.
On October 22, 2021, under the guidance of Shanghai Municipal Commission of Economy and Informatization and Shanghai Municipal Science and Technology Commission, data ape and Shanghai Big Data Alliance jointly organized the "Rubik's Cube Big Data Series Forum Digital Intelligent Transformation and Upgrading", Li Guang, co-founder of Taosi Data, pointed out the pain points of big data applications from the perspective of industrial Internet, and shared the unique big data processing methods and industry application cases of Taosi Data.
Founded in May 2017, Taosi Data is a startup focused on big data solutions, according to public information. The company developed the Internet of Things big data platform tdengine with independent intellectual property rights, officially opened it to the outside world in July 2019, and open sourced the cluster version in August 2020, gaining a large number of customers, ranking first in the github global trend ranking for many times, becoming the most popular open source project.
In May this year, Taosi Data successfully completed a $47 million B round of financing, which was recognized by many leading investors such as Matrix Partners China, Sequoia Capital China Fund, and GGV Jiyuan Capital.
Pain points: Industrial data is large, opaque, and difficult to coordinate
Industrial data is different from data in other fields and has its own characteristics. From the perspective of the underlying data, Li Guang believes that there are three major pain points in the application of data in the industrial Internet industry.
Pain point one, industrial data is large and difficult to process. Industrial data acquisition involves a large number of device endpoints, such as a factory, which may generate tens of billions of pieces of data every day, and the data storage reaches the terabyte level. With such a huge order of magnitude, how to deal with it is a major difficulty.
The second pain point is that industrial equipment has the problem of data opacity in the process of overhaul, maintenance and operation and maintenance, which brings difficulties to the digital transformation of industrial enterprises.
Pain point three, in the field of industrial control, the industrial software used by most domestic enterprises still relies heavily on foreign software, and lacks independent and controllable technologies and product solutions. From the edge to the cloud, from the field station side to the group center side, there are many problems in the efficient coordination of data.
How to solve these pain points? Li Guang believes that it is advisable to simplify the complexity and abstract the entire industrial Internet data, so as to form four steps of data "circulation and storage". Specifically, one is the collection and transmission of data; the second is the access to data; the third is the storage and analysis of data; and the fourth is the application of data. The third link - data storage and analysis is the core.
For the storage analysis of data, Li Guang observed that among the existing analysis methods in China, it can be processed through the traditional industrial real-time library, or relying on the Internet system to open source the family barrel program to deal with, but the efficiency of both is low, this huge architecture, in the relatively insufficient IT staff in the industrial scenario, maintenance is very difficult. In addition, Li Guang said that some foreign industrial control of the head enterprises, in fact, the use of traditional architecture to do the data processing solutions, it is difficult to adapt to this big data "high concurrency, easy to expand" characteristics, that is, rapid expansion, saas or serviceable model.
Small products, big deeds
In order to solve the pain points of the industry and match the needs of customers, Taosi Data has created an efficient data processing solution specifically for the Internet of Things - the Internet of Things big data platform tdengine.
tdengine is an all-in-one time series database platform built for the Internet of Things, abandoning the traditional Hadoop system and integrating the underlying big data processing related message queues, internal caches, databases, streaming computing and data subscriptions into this product. Solve the performance problems of data processing, data storage and technical architecture in one place.
While powerful, the product is only a few megabytes, occupies very little memory, and uses a distributed architecture that scales on demand to handle different data processing scales.
Since the open source, tdengine products have been widely praised by customers. In this regard, Li Guang said that this is mainly due to Taosi Data's clear understanding of the characteristics of industrial IoT data and the way data is used. Li Guang further explained that there are three main characteristics of industrial IoT data:
1. Unlike general data, the data collected in the industry is streaming data with timestamps;
2, most of these data are measured values, very stable, and the data source is unique;
3. As time increases, the value of data decreases. Moreover, in the Internet of Things, the value of a single piece of data is not high, and the analysis of the overall data is valuable.
In terms of data use, Li Guang started from the market and summarized the application needs of data in the field of industrial Internet in three aspects:
The first is whether it can support continuous writing of data; the second is whether it can support data queries based on time and label dimensions, as well as data aggregation and cross-sectional queries. Taking aggregate queries as an example, the platform needs to aggregate data from all devices for calculations, which involves some unique ways of using them, such as differences, time windows, downsampling, etc. Whether the platform has such data processing capabilities is also very important for customers; the third is whether it can support the effective compression and storage of data, that is, efficiently compress data to reduce storage space without affecting the use of industry data queries.
In fact, the traditional general-purpose platform can not yet meet these specific needs, in contrast, Taosi data self-developed tdengine platform in the core performance, shows a very large advantage.
According to Li Guang, the tdengine platform has the ability to write high concurrent data, and the data generated by massive devices can be written concurrently; not only that, when solving a series of needs such as data caching, data subscription, and data storage, Taosi Data integrates various functions into this product, and the technical architecture is very simple. The completely simplified technical architecture has liberated a large number of small and medium-sized enterprises, allowing them to undertake many projects that only the head enterprises could take over before; in terms of data storage, the tdengine platform combines the data characteristics of the Internet of Things and uses columnar storage, which greatly compresses the data storage space, and the memory occupied by the same data is only 1/5 of that of similar products.
Taosi's innovation: one source, one table, hierarchical storage
Different from the common platform, to achieve good data processing performance, how does Taosi Data do it?
Aiming at the typical characteristics of time series data of the Internet of Things, Taosi Data creatively proposes a model of "one data collection point and one table", and uses "super table" to solve the problem of data aggregation and analysis between multiple devices.
"The advantage of this is that the data at the same collection point is continuous, and the data is incremented over time. When storing, you can write directly by appending, which is the most efficient storage method. After the latest data is written to memory, it is then placed in storage media such as hard disks, and we can do a lot of precomputation while landing disks, which makes our product query ability very strong. Before we had a user, he used a program, wanted to query a set of data, spent several hours can not find out, and our products may be found out in a few seconds, the difference is very large, because we have done a lot of pre-calculation. Li Guang revealed the secret.
"It is also very important that the data of the Internet of Things has the difference between cold, warm and hot according to the chronological order of collection. For example, the latest data is of particular concern to everyone, that is, thermal data; a certain data has passed a month, it may be used slightly less frequently, we define it as warm data; if it is in the past 5 years or even longer, it is cold data. At this point, the frequency of data usage is even lower.
So how do you balance the efficiency and cost of using data storage? We will automatically store and migrate data through the multi-level storage mode, the latest hot data exists in memory, warm data in ssd, and cold data in ordinary hard disk.
When doing data aggregation, we will first split the massive amount of data, and do data filtering first through the super target tags, greatly reducing the data set, so that the efficiency of processing data will be greatly improved. Li Guang said.
Multi-industry application construction ecosystem
The big data platform plays a vital role in discovering industry laws and reducing the influence of subjective factors in strategic decision-making through big data. In terms of industry applications, Taosi Data has made many attempts.
Taking the power industry as an example, on the wind farm side of wind power generation, a large amount of data will be generated at all levels of collection, data, models, and services. According to Li Guang, data is transmitted from the side of the wind farm to the message queue, a data service is established, and all the generated data is entered and stored through the cluster, and then applied. If it is a data demand on the group side, Taosi Data will establish a functional architecture of the cloud wind power big data platform and set up a distributed database cluster at the bottom layer to process massive data.
In terms of data collaboration between the field station side and the center side, Li Guang pointed out that there are some problems in the traditional data processing method. For example, data synchronization is semi-automatic, and data partitioning leads to the need for fusion between data, resulting in problems such as reduced data processing efficiency.
Based on this, Taosi Data has innovatively proposed a set of operating schemes, which can do the collaborative synchronization of three levels of data from the field station side to the regional centralized control and then to the group structure center, which can automate all data and greatly alleviate the coordination problem of data processing.
How to further analyze the stored data, taking the wind farm as an example, Taosi Data will build a big data platform in the cloud, classify unstructured and structured business data and structured time series data, and then enter it into different systems. Since more than 80% of the industrial IoT data is time-series data, these data eventually enter the distributed time-series data full-stack processing platform for processing.
In addition to "showing its skills" in the power industry, the tdengine platform has been applied in mining, tobacco, petrochemical, smart travel and other industries. According to Li Guang, the current scale of measurement point management on the platform has exceeded 10 million levels. Among them, the tobacco silk data service platform has improved its performance by 10 times after time series insight analysis, and the efficiency improvement effect is very obvious.
In the digital era, whether or not the digital and intelligent transformation can be smoothly realized is related to the success or failure of the enterprise. This is also a rare opportunity for localization software and service companies. The rapid growth of Taosi Data in the past four years is inseparable from its deep understanding of the underlying data of the industry, and under the continuous exploration and innovation of the big data platform, the company will follow the wave of digital economy development and create good results.
Edit: MuYang / Data Ape