laitimes

Dipu Technology: Why do more and more companies choose "lake warehouse integration"?

The database industry is heading for a watershed.

Over the past few years, the global database industry has grown tremendously. In 2020, Gartner redefined the Magic Quadrant for the database field as cloud DBMS for the first time, taking cloud databases as the only evaluation direction; in 2021, the Gartner Magic Quadrant has undergone two key changes: 1. Snowflake and Databricks, two cloud data warehouses, entered the Leader Quadrant; 2. The revenue threshold limit of the Magic Quadrant, SingleStore, Exasol, and MariaDB, was released New database forces such as Couchbase have entered the list for the first time.

To some extent, behind this change, it is implied that the global database has entered a golden age of development, and it is also the year of the accelerated rise of a number of emerging forces. Among them, the most typical example is Snowlake and Databricks often shout over the air, the former is the representative player of the cloud several warehouses, last year continued to maintain more than 1 times the business growth; the latter due to the launch of the "lake warehouse integration", the valuation soared all the way to 36 billion US dollars, the dispute between the two, in fact, is the dispute between the new and old architecture of the database.

As enterprise digitalization enters the deep water area, there is also a diversified trend in data usage scenarios, and the data that was easily ignored by enterprises in the past has begun to walk from behind the scenes to the front of the stage, and how to choose a suitable database product for many scenarios has become a mandatory question for many CIOs and managers. However, one thing is certain that the database in the past has been difficult to match the current growing data complexity requirements, based on scalability and availability division, the distributed architecture breaks through the database limitations under the single-machine, shared, and cluster architecture, and the rapid development trend in recent years. To this end, this article we will mainly analyze:

1. What is the integration of data warehouse, data lake and lake warehouse?

2, the evolution of the structure, why is it said that the integration of the lake warehouse represents the future?

3. Is it a good time to lay out the integration of the lake warehouse?

01: Data lake + data warehouse≠ lake warehouse integrated

Before the emergence of lake warehouse integration, data warehouses and data lakes were the most discussed topics.

Before officially cutting into the theme, let's first popularize a concept with you, that is, what is the workflow of big data? Two relatively unfamiliar terms are involved here: the degree of structure of the data and the information density of the data. The former describes the prescriptiveness of the data itself, while the latter describes the size of the amount of information contained in the unit storage volume.

Generally speaking, most of the raw data obtained by people is unstructured, and the information density is relatively low, through the data cleaning, analysis, mining and other operations, you can exclude useless data, find the relevance in the data, in this process, the degree of data structure, information density will also increase, the last step, is to optimize the data to be used, into the real means of production.

In short, the process of big data processing is actually a process of improving the structure of data and the density of information. In this process, the characteristics of data have been changing, different data, suitable storage media are also different, so there was a once hot data warehouse and data lake dispute.

Let's start with the data warehouse, which was born in 1990 as a theme-oriented, integrated, relatively stable collection of data that reflects historical changes, mainly used to support management decisions and global sharing of information. To put it simply, a data warehouse is like a large library, the data in it needs to be put in accordance with the specification, and you can find the information you want by category.

For now, the mainstream definition of data warehouse is a large-capacity repository located on multiple databases, its role is to store a large amount of structured data, to provide unified data support for management analysis and business decisions, although the access process is relatively cumbersome, there are certain restrictions on data types, but in that era, the functionality of data warehouses has been enough, so around 2011, the market is still the world of data warehouses.

In the Internet era, the amount of data has shown a "blowout" outbreak, and data types have become heterogeneous. Limited by data scale and data types, traditional data warehouses cannot support business intelligence in the Internet era, and as Hadoop and object storage technology matured, the concept of data lake was born and proposed by James Dixon in 2011.

Compared to a data warehouse, a data lake is an evolving, scalable infrastructure for big data storage, processing, and analysis. It's like a large warehouse that can store raw data in any form (both structured and unstructured) and in any format (including text, audio, video, and images), with data lakes often larger and cheaper storage costs. But its problem is also obvious, the lack of structure of the data lake, once not well governed, it will become a data swamp.

In terms of product form, the data warehouse is generally an independent standardized product, and the data lake is more like an architectural guide, which needs to be coordinated with a series of peripheral tools to achieve business needs. In other words, the flexibility of the data lake is friendly to pre-development and pre-deployment; the pre-standardization of the data warehouse is friendly to the late operation of big data and the long-term development of the company, so is there such a possibility, is there a new architecture that can combine the advantages of data warehouse and data lake?

Thus, the lake warehouse was born. According to DataBricks' definition of Lakehouse, Lake Warehouse Integration is a new paradigm that combines the advantages of data lakes and data warehouses, and implements data structures and data management functions similar to those in data warehouses on low-cost storage for data lakes. Lake warehouse integration is a more open new architecture, some people have made it an analogy, similar to building a lot of small houses by the lake, some are responsible for data analysis, some operate machine learning, some to retrieve audio and video, etc., as for those data source streams, they can be easily obtained from the data lake.

As far as the development trajectory of the lake warehouse integration is concerned, the early lake warehouse integration is more of a processing idea, and the data lake and the data warehouse are connected to each other in the processing, and the current lake warehouse integration, although still in the early stages of development, is not just a pure technical concept, but has been given more meaning and value related to the product level of the manufacturer.

It should be noted here that "lake warehouse integration" is not equivalent to "data lake" + "data warehouse", which is a great misunderstanding, now many companies often build two storage architectures of several warehouses and data lakes at the same time, a large number of warehouses drag multiple small data lakes, which does not mean that this company has the ability to integrate the lake warehouse, the lake warehouse integration is by no means equivalent to the data lake and the data warehouse simply open, but the data will have great redundancy in these two kinds of storage.

02: Why is it said that the integration of the lake warehouse is the future?

Back to the core question of the opening chapter: How can the unity of the lake warehouse represent the future?

On this question, we can actually ask another question, that is, in the era of data intelligence, will the integration of lakes and warehouses become a necessary option for enterprises to build big data stacks? In terms of technical dimensions and application trends, the answer to this question is almost certain, for high-growth enterprises, the choice of lake warehouse integrated architecture to replace the traditional independent warehouse and independent lake has become an irreversible trend.

A convincing example is that at this stage, major cloud vendors at home and abroad have successively launched their own "lake warehouse integration" technology solutions, such as Amazon Cloud Technology's Redshift Spectrum, Microsoft's Azure Databricks, HUAWEI CLOUD Fusion Insight, Dip Technology's FastData, etc. These players have the old leaders of cloud computing and the new forces in the field of data intelligence.

In fact, the evolution of the architecture is directly driven by the business, if the business side puts forward higher performance requirements, then in the process of big data architecture construction, it is necessary to upgrade the database architecture. Taking Dip Technology, the fastest growing unicorn in the field of digital enterprise services in China, as an example, relying on fastData, a new generation of data analysis basic platform with integrated lake warehouse and integrated flow and batch, based on in-depth insight into advanced manufacturing, biomedicine, consumer circulation and other industries, Dip Technology cuts from the actual scene and provides customers with one-stop digital solutions.

Dipu believes that "in the field of data analysis, the integration of lake warehouses is the future." It can better cope with the needs of data analysis in the AI era, and is ahead of the analytical databases of the past in terms of storage form, computing engine, data processing and analysis, openness, and AI-oriented evolution. "Taking the AI application level as an example, the Integrated Architecture of Lake Warehouse is naturally suitable for AI analysis (including audio and video unstructured data storage, compatible with AI computing framework, and platform capabilities for model development and machine learning throughout the life cycle), and is also more suitable for the era of large-scale machine learning."

Dipu Technology: Why do more and more companies choose "lake warehouse integration"?

This coincides with the trend.

Not long ago, Gartner released the prediction of the future application scenarios of The Lake Warehouse Integration: the Lake Warehouse Integration Architecture needs to support three types of real-time scenarios, the first is real-time continuous intelligence; the second is real-time on-demand intelligence; the third is offline on-demand intelligence, which will be provided to data consumers through snapshot views, real-time views and real-time batch views, which is also the direction in which the future Lake Warehouse integration architecture needs to continue to evolve.

03: Is it a good time to lay out the lake warehouse?

From the perspective of market development trend, the "lake warehouse integration" architecture is based on the only way of technological development process.

However, due to the fact that this new open architecture is still in the early stage of development, the digital level and market perception of domestic and foreign enterprises are different, resulting in great differences in solutions. In the eyes of investors in the industry, "although the enterprise service market in the United States is much more mature than ours, and there are many paths to refer to, the Chinese market has a lot of Chinese characteristics." Taking Databricks' Dip Technology as an example, the US enterprise service market often sells products, but China's large customer groups need solutions that are more deeply integrated with customer experience scenarios, and the solutions need to be both versatile and customized. ”

In the previous cooperation with Dip Technology, Belle International has completed the construction of a unified number of warehouses, realizing the data collection of multiple business lines and the data construction of various business domains. Under the premise of ensuring the normal operation of front-end data and the "hot switching" of the underlying application, Dip Technology and Belle International worked closely together to integrate multiple digital warehouses into unified digital warehouses in just a few months, effectively unifying the business caliber, greatly reducing the workload of development and operations, and forming a closed loop in the entire business value chain.

Dipu Technology: Why do more and more companies choose "lake warehouse integration"?

This is also the value of the ability of "Lake Warehouse Integration": with the gradual diversification of data structures, more and more data materials such as 3D drawings, live video, conference video, audio, etc. In order to dig deep into the data value, relying on the leading lake warehouse integrated technology architecture, Belle International can first store massive multi-mode data into the lake, and when the future computing power allows, and after mining the in-depth business analysis scenario, grab data analysis from the data lake.

For a simple example, if a designer wants to design a shoe, he will generally find effective information reference from historical data, and the designer may only need a photo of the product to understand the sales performance, brand story, competitive analysis and other data of the whole life cycle of the product for many years, empower production and business decisions, and maximize the value of data.

In general, large enterprises often need to rely on large, effective data output to achieve intelligent decision-making if they want to maintain sustained growth. Many enterprises due to the limitations of IT construction capabilities, resulting in many things can not be done, but through the lake warehouse integrated architecture, so that the previously restricted data value can be fully utilized, if the enterprise can focus on the value of data at the same time, and consciously save it, the enterprise has completed one of the important propositions of digital transformation.

We also have reason to believe that with the acceleration of enterprise digital transformation, the integrated architecture of Hucang will also have a broader space for development.

Read on