1. Preface

Following the last episode, Xiao Ming was asked how to perform index tuning during the interview, and Xiao Ming mastered the knowledge of indexing and index optimization by learning the article Database Tuning - Indexing. When he was ready, he went to a new company for an interview......
Interviewer: Xiao Ming, your resume says that you know how to tune the database, how do you tune it?
Xiao Ming: Add index, index optimization......
Interviewer: Are there any other bottlenecks in index optimization, and what if index optimization is no longer improving performance?
As mentioned above, one of the reasons why SQL is slow is that there is too much data, when the amount of data in a database table reaches a certain level, due to the large number of query dimensions, even if an index is added and the index is optimized, the performance of the database is still poor. At this point, it is necessary to consider other ways to reduce the burden on the database and shorten the query time. This article describes the solution for managing large tables in databases

2. Data governance methods

The official statement states that 5 million to 8 million will affect database performance, and we define a table with more than 20 million rows or 10 GB as a large table. When a large table is expressed to a large table standard, the table needs to be governed, which is generally divided into two types: data archiving and data segmentation.

2.1 Data Archiving

Data archiving is a simple and efficient data governance solution. The amount of data in the primary table can be significantly reduced by archiving historically useless data into a separate archive table, which can be deleted directly or archived in a table for a period of time. In an actual project, you can manually archive a batch of data, or you can use a scheduled task script to archive useless data before a certain period of time. For example, for a user message table, we can archive data from a year ago, which greatly reduces the amount of data in the primary table.

2.2 Data segmentation

The basic idea of data sharding is to divide a database into multiple parts and put them on different databases or data tables, so as to alleviate the performance problems of a single database table. Generally speaking, for a database with a large amount of data, if there is a large number of data due to a large number of tables, the number of database connections is too large, and the bandwidth is under pressure, vertical sharding is suitable for using vertical sharding, that is, slicing out tables that are closely related (such as the same module) and putting them on a database. If there are not many tables, but each table has a large amount of data, it is appropriate to split horizontally, that is, to divide the data of the table into multiple data tables according to some rules. Of course, in actual projects, these two mixtures are more used together, and the corresponding sharding scheme needs to be designed according to the actual situation. According to the type of data sharding, data sharding can be divided into two ways: vertical (vertical) sharding and horizontal (horizontal) sharding.

2.2.1 Horizontal slicing

When the number of rows in the database is huge, and there is a bottleneck in the read/write and storage performance of a single database, it is necessary to perform horizontal sharding, which is divided into intra-database table sharding and database sharding, which is to disperse the same table into multiple databases or multiple tables according to different conditions according to the intrinsic logical relationship of the data in the table, and each table only contains a part of the data, so that the amount of data in a single table becomes smaller and the effect of distribution is achieved.

Governance of large tables for database tuning

2.2.2 Vertical Sections

Vertical sharding can be subdivided into vertical sharding and vertical sharding.

Vertical database sharding: Different tables with low business relevance are stored in different databases based on service coupling. Similar to "microservices", each microservice uses a separate database.
In the case of a large number of fields (a table has more than 20 fields), through "large table splitting small tables", the more commonly used fields are put into one table, and the less commonly used fields are put into the extended table, which is more convenient for development and maintenance, reduces disk IO, and can also enable MYSQL to avoid the problem of spreads (the bottom layer of MySQL is stored through data pages, and a record that occupies too much space will lead to spreads, resulting in additional performance overhead.) I won't go into too much detail here). This results in improved database performance.

3. Data segmentation project practice

3.1 Background

In order to cope with the challenge of more than 2.1 billion data volumes and 70 million single tables in the on-site message database, we decided to carry out a data segmentation project. In this context, we have planned in detail the practical steps of the project.

3.2 Segmentation scheme design

3.2.1 Select Horizontal Segmentation

Considering the huge amount of data and future increment of the message table, we decided to use the horizontal sharding method to split the message table from 3 databases with 10 tables per database to 4 databases with 256 tables per database. This design can hold more than 20 billion pieces of data on the basis of the maximum capacity of 20 million in a single table, and can support five years even without any archiving operations.

3.2.2 Rules for database and table sharding

We have formulated detailed rules for database and table sharding to ensure reasonable management and query efficiency after data sharding. The database sharding rule uses the userId % 1024 surplus, and then rounds the userId/256 to avoid data unevenness. The table sharding rule is userId % 256.

3.3 Data Migration Strategy

Due to the huge amount of data and the fact that the old database does not have a globally unique key, data synchronization can only be performed through code scripts. We use segmented queries based on IDs to improve data migration efficiency.

3.4 Data double write

Dual data write-down means that a piece of data is written to two databases or tables at the same time to ensure incremental data consistency between the two databases. In general, you need to consider the following two scenarios:

3.4.1 Before Switching

Before sharding, we need to ensure data consistency because the old database has already adopted database and table sharding. For example, userId=123456, id=123456;unionId="123456_123456". When inserting and updating data, we need to perform a series of operations to ensure the consistency of data between the old and new databases.

Insert data: After the data is written to the old database, extract the userId and id of the old database to form a globally unique key unionId, assign values to the data object of the new database, and then execute the logic of writing to the new database.
Update data: The update logic is updated by userId and id, and the userId and id of the old database are obtained by intercepting the field based on the unionId field in the new database. Then update this data, and the updated data content is based on the data of all fields of the old database, so as to ensure data consistency.

3.4.2 After Cut-Through

After slicing, the logic for updating the data needs to be further considered. You need to intercept the fields to get the userId and id of the old database based on the unionId in the new database, and then update the data of the old database based on this information. This double-write mechanism ensures the consistency of data before and after sharding.

Insert data: After the data is written to the old database, extract the userId and id of the old database to form a globally unique key unionId, assign values to the data object of the new database, and then execute the logic of writing to the new database.
Update data: The update logic is updated by userId and id, at this time, after the new database is updated, because the ID of the new database is not the corresponding ID of the old database, you need to intercept the fields according to the unionId in the new database to get the userId and id of the corresponding old database, and then update the data of the old database according to the userId and id.

3.5 Traffic cut

To ensure a smooth transition, we switch the service to read data from the new database by configuring a switch and monitor whether the service is running normally. At the same time, we promptly notify downstream related parties to switch to the new database, mainly including binary logs, big data, real-time data, and other data warehouse offline tables.

3.6 Stop writing old libraries

Before we stop writing data to the old database, we perform sufficient checks to ensure that no traffic is accessed by the old database (including data synchronization traffic from offline tables such as binlogs). By configuring the switch, we can stop writing data to the old database and monitor whether the service is running normally.

3.7 Reclaim Resources > Clean Code

Once the business monitoring is normal and any traffic from the old database accesses, we can take the old database offline. This includes deleting the configuration of the old data source, the code that writes the data, and the code that links the switches. This step needs to be implemented carefully to ensure the smooth running of the business.

3.8 Project Revenue

By slicing the message table horizontally, we have achieved significant project benefits:

The amount of data has decreased significantly: the data of the new station has decreased from 60 million+ in the old table to 2.3 million+, and the data in a single table has decreased by 95%.
Slow SQL Problem Solved: Slow SQL was reduced from 80+ per month to zero, and system performance was significantly improved.
The query performance of APIs is greatly improved: After switching to a new table, the time taken for 95 lines and 99 lines is increased by 35% and 50%.

Fourth, the advantages and disadvantages of data segmentation

4.1 Pros:

Reduce coupling: Vertical sharding can store different tables in different databases based on business modules, thereby reducing the coupling of system modules. Each business module uses an independent database, which is conducive to independent development, maintenance, and upgrade between modules.
Easy maintenance: Vertical sharding makes the management of different business tables more refined, and different databases can be monitored and maintained to varying degrees according to the importance and characteristics of the business. This helps to identify and solve problems in a timely manner, improving the stability and maintainability of the system.
Improved performance: Horizontal sharding can effectively improve performance in high-concurrency scenarios. By dispersing data into multiple databases or tables, the amount of data in a single database table is reduced, which helps to reduce the bottleneck of I/O, database connections, and hardware resources of a single machine, and improves the concurrent processing capacity of the system.
Flexible Scalability: Vertical slicing allows for more flexible expansion of the system. Since different business tables are stored in different databases, different databases can be independently expanded according to the needs of business growth without affecting the operation of the overall system.

4.2 Cons:

Unable to JOIN across databases: Horizontal sharding prevents some tables from being directly joined, which increases the complexity of development through API aggregation at the business layer. The solution includes asynchronous aggregation through caching, message queues, etc.
Complex distributed transaction processing: In distributed transaction scenarios involving multiple databases, maintaining data consistency becomes more complex. In this case, it is necessary to adopt distributed transaction management tools or ensure transaction consistency through application-layer design.
Difficult to scale data multiple times: As your business grows, it can be difficult to scale your database again. Especially in the case of horizontal slicing, it is necessary to carefully plan for the re-expansion of the data to avoid over-complicating the system.
Query depends on shard key: The query performance of horizontal shard is highly dependent on the shard key. If the query doesn't contain a shard key, it can cause the query to slow down. Therefore, when sharding data, you need to consider the query pattern and try to match the query and the sharding key.

4.3 Applicable Scenarios:

When you decide whether to perform data segmentation, you need to consider the specific business scenarios and requirements. Applicable scenarios include, but are not limited to:

The amount of data is huge, and there are bottlenecks in the read/write and storage performance of a single database.
The system needs to be flexibly expanded, and the correlation between different business tables is not strong.
In high-concurrency scenarios, the concurrent processing capability of the system needs to be improved.
It has high requirements for system stability and maintainability.

In general, data segmentation is a comprehensive strategy, and it is necessary to weigh its advantages and disadvantages according to the actual situation and choose the most suitable solution.

5. Summary

Considering the database optimization scheme for large table governance, we draw the following conclusions:

Use data sharding sparing: Data sharding is a powerful tool, but not all tables need to be sharded. In the early stage of the design, you should evaluate the growth of your business in the next one to two years in advance, and consider whether you can improve the performance of the database through other means, such as hardware upgrades, network upgrades, read/write splitting, and index optimization. Only when the amount of data reaches the bottleneck of a single table, database and table sharding should be considered to avoid over-design and premature optimization.
Horizontal sharding is suitable for scenarios with large amounts of data, which can improve the number of I/O, database connections, and hardware resource bottlenecks to a certain extent, while vertical sharding can reduce the coupling of business systems and perform hierarchical management, maintenance, monitoring, and expansion of data for different businesses.
Consider the advantages and disadvantages in detail: Data sharding has certain advantages, such as improving system stability and load capacity, but it also comes with some disadvantages, such as the inability to join some tables, complex distributed transaction processing, and difficult data expansion. These factors need to be fully considered before implementation, and the pros and cons need to be weighed.
Carefully plan the project practice: When implementing a data sharding project, you need to carefully plan the sharding scheme, data migration, data dual write, business traffic cutting, and stop writing old databases.
Significant benefits: Through the practice of horizontal sharding of the message data table on the site, we have achieved significant project benefits, including a significant reduction in data volume, zeroing of slow SQL statements, and improved interface query performance. This fully shows that data segmentation is an effective optimization method in suitable scenarios.

In summary, database large table governance is a comprehensive project, and various solutions need to be carefully selected and implemented to maximize database performance and ensure system stability and maintainability.

About the Author:

Chen· Small-K，后端研发专家

Source: WeChat public account: Auction yard

Source: https://mp.weixin.qq.com/s/Rg7Le324KKoY-P1nXCTn0g

Governance of large tables for database tuning