1. Overview

"Proposed by W.H. Inmon, the father of data warehouse, in 1990, the main function is to systematically analyze and sort out the large amount of data accumulated by the organization through the online transaction processing (OLTP) of the information system over the years through the unique data storage architecture of the data warehouse theory, so as to facilitate the conduct of various analysis methods such as online analysis and processing (OLAP) and data mining, and then support such as decision support system (DSS) and competent information system (EIS). It helps decision-makers quickly and effectively analyze valuable information from a large amount of data, so as to facilitate decision-making and quickly respond to changes in the external environment, and help build business intelligence (BI). The above paragraph is from Wikipedia's definition of a data warehouse.

In the face of such a huge amount of data, the common OLTP (online transaction processing) database analysis engine has been unable to cope with the statistics and analysis of data from different business departments. The premise of data analysis and application is to organize different data sources within the enterprise, sort out the development of data models and ensure data quality. This article describes how the basic data warehouse team completes the above-mentioned data source integration, data model development, and data quality through related architectures.

2. The basic construction goal of the data warehouse

The basic data warehouse platform collects business data and behavior log data including but not limited to (users, stores, suppliers, goods, logistics, etc.) in each terminal within the group, and processes the data subject areas of multiple dimensions, levels, and different timeliness through the four standards ("unified cleaning", "unified naming", "unified measurement", and "unified specification"). Finally, it can meet the needs of the data indicators that are convenient for each department of the group (they can be found (there are indicators, and the indicators can be checked at any time), the data is correct (the data in the data warehouse and the data in the business library should be consistent), and the simple SQL can be queried (if it can be queried without using operators such as join, try not to use join)).

3. Analysis of the construction of data warehouses in the industry

After the concept of data warehouse was established, there were many controversies about the implementation method, implementation path and architecture of data warehouse. Around 1994, most of the companies that implemented data warehouses failed, leading to the concept of data marts being proposed and widely used, represented by Ralph Kimball. Since the data mart is only a part of the data warehouse, the implementation difficulty is greatly reduced, and it can meet the urgent needs of some business departments within the company, and it has achieved great success in the early stage. However, as data marts continue to grow, the shortcomings of this architecture gradually become apparent. The data marts built independently within the company follow different standards and construction principles, resulting in confusion and inconsistency in the data of multiple data marts. The only way to solve the problem is to go back to the original infrastructure principles of the data warehouse.

3.1 Development of data warehouse theory

(1) Embryonic stage. The concept of data warehouse can be traced back to the 70s of the 20th century, MIT researchers are committed to the study of an optimized technical architecture, which tries to separate the business processing system and the analysis system, that is, the business processing and analysis processing are divided into different levels, and different architectural design principles are adopted according to their own characteristics. However, due to the limited information processing capacity at that time, this research only stayed at the theoretical level.

(2) Exploration stage. In the mid-to-late 80s of the 20th century, DEC established the TA2 (Technical Architecture 2) specification based on the research conclusions of MIT, which defines the four components of the analysis system: data acquisition, data access, directory and user service. This was a major shift in system architecture, and for the first time it was explicitly proposed to analyze the system architecture and put it into practice.

(3) Embryonic stage. In 1988, in order to solve the problem of enterprise-wide integration, IBM first proposed the concept of Information Warehouse, which was called the VITAL specification (Virtually Integrated Technical Architecture Lifecycle). VITAL defines 85 types of information warehouse components, including PCs, graphical interfaces, object-oriented components, and local area networks. At this point, the basic principles of the data warehouse, the technical architecture, and the main principles of the analysis system have been determined, and the data warehouse has begun to take shape.

(4) Establishment stage. In 1991, Bill Innon published his first book on data warehousing, Building the Data Warehouse, which marked the establishment of the concept of data warehousing. The book points out that a DataWarehouse is a Subject Oriented, Integrated, Non-Volatile, and Time Variant data set that is used to support decision-making support. The book also provides guidance and basic principles for building a data warehouse. With this book, Bill Innon is known as the father of the data warehouse.

3.2 Data warehouse architecture history

(1) Traditional data warehouse architecture

Xingsheng optimizes the construction of data warehouse system

（2）Lamada数仓架构

Lambda架构可分解为三层，即Batch Layer,Real-Time(Speed) Layer和Serving Layer。

Batch Layer: stores datasets, precomputes query functions on the datasets, and builds views corresponding to queries. Batch Layer can handle offline data well, but there are many scene data that is constantly generated in real time and needs to be queried and processed in real time, and Speed Layer is more suitable for this situation.

Speed Layer: The Batch Layer processes the entire data set, while the Speed Layer processes the most recent incremental data stream. For the sake of efficiency, Speed Layer continuously updates the Real-time View after receiving new data, while Batch Layer directly obtains Batch View based on the entire offline data set.

Serving Layer:Serving Layer用于合并Batch View和Real-time View中的结果数据集到最终数据集。

(3) Kappa data architecture

The Kappa architecture can be thought of as a simplified version of the Lambda architecture (as long as the batch part of the lambda architecture is removed).

The biggest problem with the Kappa architecture is that the throughput of streaming reprocessing history is lower than that of batch processing and replay of historical data.

3.3 Common Internet data warehouse architecture

3.3.1 Meituan's data warehouse architecture diagram

(1) Offline architecture

(2) Real-time architecture

Summary of real-time architecture: through the abstraction of basic processing (cleaning, filtering, merging, expansion, transformation, filtering and other conventional operations) to form basic general components to generate the corresponding data result stream, for different business lines need to be compatible with the situation through redundant related fields to meet, real-time data only does not care about historical data redundant fields will not consume too much memory, for real-time requirements are not high requirements for statistics directly imported into the OLAP real-time engine, the OLAP engine itself calculation and query to meet the rapid withdrawal of calculations, That is, through space for time.

3.3.2 Youzan warehouse architecture diagram

(1) Lambda-based architecture

Youzan's data warehouse architecture is a typical lambda architecture, and the real-time data warehouse and offline data warehouse are two different lines: the computing engine of the offline data warehouse is hiveSQL and sparkSQL, and the calculation engine of the real-time data warehouse is flink. The data of the ODS and DW layers are stored in HDFS, and the data of the Bazaar application layer is stored in different engines to meet different usage scenarios.

4. Xingsheng optimizes the architecture & construction practice of data warehouse

4.1 Lambda architecture based on data lake

Our current business is still iterating rapidly, and business data model modification is a common thing, so the kappa architecture is obviously not suitable, because it often involves the re-running of historical data, and at the same time, our current business has high requirements for the real-time nature of data, especially some scenarios also involve real-time changes in data status, which requires the support of data lake capabilities such as Hudi. Therefore, according to our business, we have produced the following schematic data warehouse architecture:

As can be seen from the above architecture diagram, we are mainly based on the Spark computing engine, and the data will be stored in Hudi, and the reason for the Hudi-based is that some of our businesses will modify the data written for a long time, and at the same time, it can also provide quasi-real-time data construction requirements, and the current visibility delay of the batch line can be about 5 minutes.

The real-time line is based on Flink for calculation, and the data will be written back to Kafka to do real-time computing to the next layer of the data warehouse, and the real-time data visibility delay can be achieved in seconds, and we will also synchronize this part of the data to Hudi, and the purpose of storing it in Hudi is twofold:

1. The data table of real-time computing can be used for subsequent batch computing, which plays a role in sharing data and saving resources.

2. At the same time, Hudi's storage is based on OBS, so the storage cost is much lower than that of kafka's storage, kafka we generally retain about 7 days of data, and Hudi's data we retain for more than 1 year, and this part of the data can also be used for subsequent model iteration calculations.

4.2 Model Architecture

The model architecture of the data warehouse is built according to the current industry-wide hierarchical structure: Patch Source Layer (ODS), Common Detail Layer (DWD), Common Dimension Layer (DIM), Common Summary Layer (DWS), and Common Application Layer (ADS).

Level	Description of the function
Source Layer (ODS)	The data warehouse system that stores the raw data values that have not been processed is structurally consistent with the business system, and is the data processing preparation area of the data warehouse, and the data needs to be fully retained in principle
Common Detail Layer (DWD)	Based on the business process as the modeling drive, the most fine-grained detail-level fact table is constructed based on the characteristics of each specific business process. The granularity of the data is generally the same as that of the ODS layer, and the corresponding data quality assurance is provided. At the same time, in order to improve the ease of use of the data detail layer, some dimension degradation methods are used to degenerate the dimensions into the fact table, reduce the association between the fact table and the dimension table, and even associate it with other fact tables to make a detailed wide table, reuse the association calculation, and reduce data scanning.
Common Dimension Layer (DIM)	Dimension refers to the perspective of observing things, providing a description of what the time of a certain business process involves with what to filter and classify, "who, when, what place, why to do" and so on belong to the perspective of seeing things, dimension representation is the basis and soul of dimensional modeling, based on the concept of dimension modeling, establish a consistent dimension of the entire enterprise, reduce the risk of data calculation caliber and algorithm inconsistency, such as "Xiao Wang spent 5 yuan to buy buns in the store in the morning", time dimension - morning, location dimension - commissary, In the process of building the dimension layer, the observation dimension will be listed, for example, the time dimension is the morning, not a specific time point.
Common Aggregation Layer (DWS)	Based on the subject object of analysis as the modeling drive, based on the indicator requirements of the upper-layer application and product, a common granularity summary index fact table is constructed, and the model is materialized by means of wide table. Construct statistical indicators with a naming convention and consistent caliber, provide common indicators for the upper layer, and establish summary wide tables.
Common Application Layer (ADS)	The data application layer, also known as the DM (data mart) or APP layer, is oriented to the actual data requirements, can be directly used by business personnel, and based on DWD or DWS data, various statistical reports are composed. In addition to this, there are some direct manifestations, such as the theme of the large and wide table bazaar.

4.3 Data warehouse construction practice

4.3.1 Data Access

The data sources of the data warehouse are mainly business database relational database table data, behavior log data, and business transformation result set data. The main goal of data ingestion is to maintain the consistency of data with the business database and table, and the most important work content is how to solve the work content of data synchronization, data update, schema change, and data verification in the process of synchronization.

The data of the relational tables of the business database is mainly ingested and stored in Hudi in quasi-real time by the data synchronization platform (binlog + canal + sparkstream), and the Hudi feature Upsert mode is used to ensure the incremental update of data, dynamic comparison of the shema table structure, and timely update of the structure.

We choose to provide second-to-minute storage performance by (flink's filesystem connector), provide file scrolling policies, file compation modes such as (within checkpoint periods), and partition submission visibility configurations to solve the problem of a large number of small files on the storage side of Realtime Compute, alleviate the high load problems such as metadata loading, and filter target information.

The high-performance synchronization operator encapsulates the data of the business processing result set by the scheduling platform and directly synchronizes it to the data warehouse, and the data source, database, and table on the source side and the data source, database, and table on the target side can be filled in visually, and the fields can be mapped.

At present, the data source table has been connected to more than 800 tables and is running stably.

4.3.2 Data Standards

According to the definition of the China Academy of Communications, data standards refer to normative constraints that ensure the consistency and accuracy of internal and external use and exchange of data. The goal of data standards is to provide standard support for data sharing, clear semantics, and easy to understand and use.

So what is the construction of the data warehouse team in this regard?

1. Sort out business standards and specifications; For example, business function matrix, noun dictionary, and modeling specifications.

2. Sort out technical standards and specifications; For example, unify real-time and offline table name formats, data type definitions, and encoding rules.

4.3.3 Data Model

How to build a data model is something that everyone in the data warehouse team needs to face every day. What are the pre-work supports for the construction of an enterprise-level data model that covers a wide range of services, has high reuse rate, reliable data quality, is easy to understand, and has fast query efficiency?

1. Business analysis

2. Business data combing

3. Sort out the business process

4. Build a data model (the basic unit of business process)

At present, common modeling methods: "entity modeling" and "dimensional modeling", entity modeling is mainly used for OATP database modeling, according to the relationship between business tables according to a certain foreign key association and integration, focusing on data integration processing, dimensional modeling is oriented to the construction of analysis scenarios, and the construction of model tables based on data subject domain and redundant related dimensions is fast and flexible to support data operation and analysis needs.

5. ETL processing

4.3.4 Data Quality

After the construction of the previous data model, a number of basic core data asset tables have been formed, and data quality refers to ensuring the availability, reliability, and timeliness of these core data.

The data quality detection methods mainly include six aspects: "consistency", "completeness", "completeness", "standardization", "timeliness", "uniqueness" and "accuracy".

dimension	description	Case
completeness	Whether the data information is missing, which may be the entire data record or the missing record of a field in the data	Order-related dimension information, such as store, product, and limited quantity information, is detected as null value
Normative	The extent to which the data follows predetermined syntax rules and whether it conforms to its definition, such as the type, format, and range of values of the data	Order type enumeration value range detection
consistency	Whether the data collection maintains a uniform format following a uniform specification	Whether the value of the amount is checked according to the score format
accuracy	Whether there are anomalies or errors in the information recorded in the data	The order status value and payment time value are empty detection after-sales
uniqueness	There is no duplication of data	Duplicate detection of order primary key uniqueness
Timeliness	The time interval between when the data is generated and when it can be viewed	Order real-time data delay monitoring

Currently, the data quality rules run configuration is as follows:

1. Bind the task node - synchronous serial execution, and run the verification rule immediately after the data task is calculated.

2. Staggered with the task execution time - asynchronous scheduled execution, and the rule time trigger is executed at the point.

3. Task running status detection, regularly detect task status, send messages, and call alarms.

At present, the coverage rate of quality verification rules in core business tables has reached 100%, and the rule coverage rate of other hierarchical tables is still relatively low.

4.3.5 Data Services

After the construction of a data warehouse, it is necessary to provide customers with a channel for using data assets through data services, which are the windows that connect front-line businesses. Data service methods provide different service methods according to business users. Common ones: ad hoc query, data synchronization, data interface, data report, and the current data warehouse platform supports all of the above methods except for data interfaces.

DataStudio's Data Service - Ad Hoc Query:

DataStudio's Data Service - Data Synchronization:

Data Report of Lingxi BI - Data Report:

5. Summary

At present, Xingsheng Optimized's basic data warehouse follows the industry-standard data warehouse hierarchical model to build, but in terms of basic architecture, it combines more of Xingsheng's preferred business forms to build an architecture like Flink++Spark+Hudi, which widely uses the functions of the data lake, which is a relatively large exploration and attempt in the industry. The real-time data based on the Flink computing engine can support second-level visibility latency, and the Spark batch line combined with Hudi can achieve a visibility delay of 5-10 minutes.

In the future, the data warehouse will continue to focus on usability (cooperate with data analysis students to sort out common e-commerce analysis scenarios and support multi-dimensional drill-down analysis), stability and accuracy (the data in the data warehouse can be produced on time and be consistent with the business library), convenience (constantly enrich the wide table of multiple scenarios, If you can avoid using complex operators such as join, try not to use it for fast query).

Author: Thrive Preferred Technology Community

Source: https://zhuanlan.zhihu.com/p/586826059

Xingsheng optimizes the construction of data warehouse system