Migrate data from a private message service to a PolarDB cluster

introduction

Business Background

As a popular social app, Yingke's private message function is an important bridge to maintain user interaction. With the growth of the number of users, the original MySQL database can no longer meet the growing demand for data storage. This topic describes how to migrate the private messaging service from MySQL to PolarDB to solve the storage bottleneck problem and improve the scalability of the system.

The status quo of the business

The current database usage of the Direct Message service is as follows:

There is a lot of reading and writing
The database is divided into N databases and N tables
SQL statements do not use special features
The read service is preceded by a Redis caching layer
Compute utilization is low, and storage is at a bottleneck

The business architecture is shown in Figure 1, and the database usage is shown in Figure 2

Figure 1

Figure 2

Currently, the Direct Messaging service faces the following challenges:

The amount of data continues to grow, with 85% of storage space currently in use and continuing to grow every day.
How to migrate data quickly and without business loss, modify the current database operation code as intrusively as possible, fully support the MySQL protocol, and support massive storage and dynamic scaling.

Migration scenario exploration

Migration solution selection

In view of the above situation, we consider what solutions are currently available based on the current business and cost. During this period, we considered the feasibility and limitations of each of the following options.

Optimize storage
Vertical scale-out
Horizontal scale-out
Distributed databases

Let's analyze the pros and cons of each plan one by one.

Optimize storage

1. Archive data from several years ago, which can free up 20% of the space, but there is a consistency problem between Redis and DB. For example, if the msgid taken by Redis cannot be found in the database, you may need to develop a lazy data loading function, and you need to clean up the Redis cache and archive data at the same time.

2. Compress the content field to free up 30% of the space. You need to modify the business code and modify the historical data.

Advantages: There is no need to increase the cost of the machine

Disadvantages: You need to modify programs, write archives, compress scripts, and there are still storage space bottlenecks in the long run.

Vertical scale-out

Vertical expansion means hardware improvement, and the problems existing in hardware improvement are:

First, whether the currently selected storage can support larger storage (such as 3 to 20 Tbit/s), and the current specifications of our RDS have reached the maximum disk capacity.

Second, storage expansion may be affected by computing, that is, storage and computing need to be upgraded at the same time (this cost is significantly increased, and in terms of current business, the bottleneck is not computing, mainly storage), and computing must be upgraded in order to upgrade storage, which virtually leads to high overall costs.

Advantages: No sense of business, no need to transform the program.

Disadvantages: The monthly cost needs to be increased, and the computing resources are wasted.

Horizontal scale-out

Because our business itself is database and table sharding, we can split half of the database and table to another set of new clusters to share the storage, which means to reduce the amount of stored data on each machine by increasing the number of machines, if the storage of each machine has reached the maximum, and continue to use Rds-MySQL, you can solve it through this solution, the disadvantage of this solution is that you need to write a program to clean the data, the main step is to synchronize the current data to another newly created cluster using DTS, at this time, each side retains half of the data, the other half is deleted, and then provides services to the outside world at the same time。

Pros: Storage can be cut in half, or even more

Disadvantages: You need to write programs to clean data, which increases the cost

Distributed storage databases

After comparing various distributed database products, we selected PolarDB for MySQL as the target database for migration. PolarDB's storage-compute separation, high availability, and horizontal scalability are ideal solutions to our current problems. In a distributed database, the upper layer will implement the parsing layer, optimized execution layer, and engine layer of mainstream database syntax, and rely on various logs to achieve synchronization and reliability, and finally put disks into the distributed file system to achieve multiple replicas and high availability. What is the difference between the distributed database PolarDB for MySQL and MySQL in terms of storage, as shown in the following figure, the distributed database has a global data, of course, for the sake of availability and consistency, each node still needs to synchronize and coordinate the data, and in turn, Rds MySQL, because of the design problem of the software itself (of course, there was not so much data to store when it was born), the master and slave must each have a complete data.

PolarDB does not need to increase storage costs due to the increase of slaves, because all compute nodes in the cluster share this data, saving us a lot of storage space.

Migration implementation strategy

Overview of migration scenarios

In the migration solution, we considered two strategies: shutdown migration and online migration, and finally adopted the online migration strategy.

Create a new PolarDB instance and enable data synchronization from DTS. After data synchronization caught up, we chose to switch the migration during off-peak hours to maintain the insensitivity to the service to the greatest extent.

Offline migration steps

Create a PolarDB for MySQL instance
Enable DTS data synchronization
After the data synchronization catches up, find a low-peak time to stop the service pod, and continue to wait for the DTS synchronization to be completely consistent
The modified database in the program is linked to PolarDB
Continue to publish online
The migration is complete

Online migration steps

Preparing for the migration:

DBA: Create a PolarDB for MySQL instance in advance
DBA: Enable DTS data synchronization
R&D: Program modification

Add dual-write connection information so that there are connection pools for MySQL and PolarDB at the same time
Stop writing data transformation, because we have a layer of Redis cache in front of the database, we can make data write to the cache and write to MySQL with delay, the specific transformation is as follows:

Add operation: Use Chan to temporarily save the written data
Update, Delete operations: Use the Sleep blocking database goroutine to execute

Add a migration switch to control the migration status at each step with Redis

1: Read and write MySQL statements
2: Stop writing to the program
3:双写,读PolarDB
4: reads and writes to PolarDB

During the migration:

R&D: Find the off-peak period, set the switch state to 2, and stop writing data
DBA: Observe that DTS synchronizes all MySQL data to PolarDB and then stops DTS, which takes about 1-2 minutes
R&D: Set the switch state to 3 and start dual writing
R&D and DBA: Check whether the number of record rows in the two databases is consistent, observe error logs, and send private messages on the app to verify whether they are available

After the migration is complete:

R&D: If an error is found, set the switch status to 1 and switch back to read and write MySQL
R&D: Observe for 1 or 2 days, if there is no problem, set the switch status to 4, completely read and write PolarDB, delete the MySQL information connected to the program, replace it with the PolarDB connection information, remove the switch logic from the code and release the project from the new one
DBA: Observe whether MySQL still has traffic, and if not, drop the instance
The migration is complete

Online migration explained

Our purpose is to use a Redis switch to allow the program to switch between MySQL and PolarDB at any time, and to stop writing, so that we can ensure that the data in MySQL is completely synchronized to PolarDB, and the MySQL and PolarDB data are exactly the same, and finally let the traffic write both MySQL and PolarDB. If you find any exceptions, you can switch all to MySQL, otherwise switch all to PolarDB to complete the database switchover, and then turn off the switch, and the code will be completely changed to read and write PolarDB once before publishing.

There is a problem in the above process, how to stop writing data at runtime, such as add, delete, upate operations, after analyzing the code, we find that we can use the channel staging, delay the execution of the goroutine to solve. Specifically, this strategy should also be combined with the business to see if it can accept the inconsistency of short 1 or 2 minute data. If it is acceptable, then analyze how much memory can be occupied by the channels data and blocked Goroutines that are staged in these 1 or 2 minutes, whether it is necessary to increase the memory in advance, find the moment of the day with the least traffic, calculate how much buffer Chan needs by calculating the write QPS of this time, and how many Goroutines need to be blocked, after calculation, it is completely acceptable, and these steps should be best logged.

For example, if the business peak is low at 6 a.m., 500 QPM is added and 100 QPM is added to update, and a total of 10 pods of business instances are calculated.

Add: 500 / 10 = 50 Channel缓冲区长度

Update: 100 / 10 = 10 Goroutine数量

That is, each pod can add up to 10 coroutines and temporarily store 50 channels of data in this minute, which has no effect on the conversion of data to memory.

The following is a diagram of the switching status of the whole process, as follows:

PolarDB metrics after the migration

After the migration, the PolarDB monitoring metrics also meet expectations

summary

Precautions

Stopping the write process if there is a panic memory data loss is a problem, so pay attention to the test environment to drill the code to make sure that there is no problem, and keep a log to be safe with the attached persistent storage volume, so that the pod is really out and there is still a place to recover the data.
Online migration requires a complete grasp of the current business code and cannot omit database operations, otherwise the data migration will be incomplete.

Migration results

Compared with offline service outages, migration has less impact on services
After the migration, the cost of MySQL was reduced by 18% compared with the previous one.
THE P99 TAKES ABOUT 40 MS, WHICH IS IN LINE WITH BUSINESS EXPECTATIONS
After migration, computing and storage are separated. Storage up to 100TB is supported

Author: Wang Jichao

Source-WeChat public account: Yingke Technology

Source: https://mp.weixin.qq.com/s/AywOQDw4t_XmChZ0TU_YIA

Migrate data from a private message service to a PolarDB cluster

Read on

Always take the "three rural" meteorological services as the top priority

"May Day" I am on the job丨Intimate service, so that you can release the ultimate "sense of relaxation"

To create a comprehensive quality life service project for employees, the first quality life festival of Cao Lu was launched

The owner slept in the high-speed service area and was fined 500!

Hedong: "May Day" holiday cultural tourism volunteer service" will not close

"Reverse Tour" Becomes the Mainstream Zhou Guohui: Create characteristics and provide high-quality services, and "small places" can also become top-notch

Outrageous! The woman's room was brushed open when she slept naked, the hotel dumped the waiter, and the netizen comment area quarreled

Dry goods sharing|Hotel front office, room service complex scene handling guide

Lixian People's Hospital: Extended services to send warmth and home care have "medical" support

Laughing numbly, the men's service area sold melons for 1800 in less than an hour, and I died laughing in the netizen comment area

From a waiter at the table to a CEO with an annual salary of 10 million, Yang Lijuan, the second-in-command of Haidilao, will receive 140 million red envelopes

The more noisy it is, the bigger it gets! A man sells melons in the service area, and 1800 is credited in 40 minutes!

What should I do if the acceptor is overdue due to forgetting to pay in time, and the acceptance service is suspended by the Bills Exchange

How to solve the offline steam store server Three solutions to share

How to solve the problem that the Battle.net installation has been stuck in|Updating the Battle.net update service The method is here

This government service supermarket can enjoy these services