laitimes

Migrate data from a private message service to a PolarDB cluster

author:Flash Gene
Migrate data from a private message service to a PolarDB cluster

introduction

Business Background

As a popular social app, Yingke's private message function is an important bridge to maintain user interaction. With the growth of the number of users, the original MySQL database can no longer meet the growing demand for data storage. This topic describes how to migrate the private messaging service from MySQL to PolarDB to solve the storage bottleneck problem and improve the scalability of the system.

The status quo of the business

The current database usage of the Direct Message service is as follows:

  • There is a lot of reading and writing
  • The database is divided into N databases and N tables
  • SQL statements do not use special features
  • The read service is preceded by a Redis caching layer
  • Compute utilization is low, and storage is at a bottleneck

The business architecture is shown in Figure 1, and the database usage is shown in Figure 2

Migrate data from a private message service to a PolarDB cluster

Figure 1

Migrate data from a private message service to a PolarDB cluster

Figure 2

Currently, the Direct Messaging service faces the following challenges:

  • The amount of data continues to grow, with 85% of storage space currently in use and continuing to grow every day.
  • How to migrate data quickly and without business loss, modify the current database operation code as intrusively as possible, fully support the MySQL protocol, and support massive storage and dynamic scaling.

Migration scenario exploration

Migration solution selection

In view of the above situation, we consider what solutions are currently available based on the current business and cost. During this period, we considered the feasibility and limitations of each of the following options.

  • Optimize storage
  • Vertical scale-out
  • Horizontal scale-out
  • Distributed databases

Let's analyze the pros and cons of each plan one by one.

Optimize storage

1. Archive data from several years ago, which can free up 20% of the space, but there is a consistency problem between Redis and DB. For example, if the msgid taken by Redis cannot be found in the database, you may need to develop a lazy data loading function, and you need to clean up the Redis cache and archive data at the same time.

2. Compress the content field to free up 30% of the space. You need to modify the business code and modify the historical data.

Advantages: There is no need to increase the cost of the machine

Disadvantages: You need to modify programs, write archives, compress scripts, and there are still storage space bottlenecks in the long run.

Vertical scale-out

Vertical expansion means hardware improvement, and the problems existing in hardware improvement are:

First, whether the currently selected storage can support larger storage (such as 3 to 20 Tbit/s), and the current specifications of our RDS have reached the maximum disk capacity.

Second, storage expansion may be affected by computing, that is, storage and computing need to be upgraded at the same time (this cost is significantly increased, and in terms of current business, the bottleneck is not computing, mainly storage), and computing must be upgraded in order to upgrade storage, which virtually leads to high overall costs.

Advantages: No sense of business, no need to transform the program.

Disadvantages: The monthly cost needs to be increased, and the computing resources are wasted.

Migrate data from a private message service to a PolarDB cluster

Horizontal scale-out

Because our business itself is database and table sharding, we can split half of the database and table to another set of new clusters to share the storage, which means to reduce the amount of stored data on each machine by increasing the number of machines, if the storage of each machine has reached the maximum, and continue to use Rds-MySQL, you can solve it through this solution, the disadvantage of this solution is that you need to write a program to clean the data, the main step is to synchronize the current data to another newly created cluster using DTS, at this time, each side retains half of the data, the other half is deleted, and then provides services to the outside world at the same time。

Pros: Storage can be cut in half, or even more

Disadvantages: You need to write programs to clean data, which increases the cost

Migrate data from a private message service to a PolarDB cluster

Distributed storage databases

After comparing various distributed database products, we selected PolarDB for MySQL as the target database for migration. PolarDB's storage-compute separation, high availability, and horizontal scalability are ideal solutions to our current problems. In a distributed database, the upper layer will implement the parsing layer, optimized execution layer, and engine layer of mainstream database syntax, and rely on various logs to achieve synchronization and reliability, and finally put disks into the distributed file system to achieve multiple replicas and high availability. What is the difference between the distributed database PolarDB for MySQL and MySQL in terms of storage, as shown in the following figure, the distributed database has a global data, of course, for the sake of availability and consistency, each node still needs to synchronize and coordinate the data, and in turn, Rds MySQL, because of the design problem of the software itself (of course, there was not so much data to store when it was born), the master and slave must each have a complete data.

PolarDB does not need to increase storage costs due to the increase of slaves, because all compute nodes in the cluster share this data, saving us a lot of storage space.

Migrate data from a private message service to a PolarDB cluster
Migrate data from a private message service to a PolarDB cluster

Migration implementation strategy

Overview of migration scenarios

In the migration solution, we considered two strategies: shutdown migration and online migration, and finally adopted the online migration strategy.

Create a new PolarDB instance and enable data synchronization from DTS. After data synchronization caught up, we chose to switch the migration during off-peak hours to maintain the insensitivity to the service to the greatest extent.

Offline migration steps

  1. Create a PolarDB for MySQL instance
  2. Enable DTS data synchronization
  3. After the data synchronization catches up, find a low-peak time to stop the service pod, and continue to wait for the DTS synchronization to be completely consistent
  4. The modified database in the program is linked to PolarDB
  5. Continue to publish online
  6. The migration is complete

Online migration steps

Preparing for the migration:

  1. DBA: Create a PolarDB for MySQL instance in advance
  2. DBA: Enable DTS data synchronization
  3. R&D: Program modification
    1. Add dual-write connection information so that there are connection pools for MySQL and PolarDB at the same time
    2. Stop writing data transformation, because we have a layer of Redis cache in front of the database, we can make data write to the cache and write to MySQL with delay, the specific transformation is as follows:
      1. Add operation: Use Chan to temporarily save the written data
      2. Update, Delete operations: Use the Sleep blocking database goroutine to execute
    3. Add a migration switch to control the migration status at each step with Redis
      1. 1: Read and write MySQL statements
      2. 2: Stop writing to the program
      3. 3:双写,读PolarDB
      4. 4: reads and writes to PolarDB

During the migration:

  1. R&D: Find the off-peak period, set the switch state to 2, and stop writing data
  2. DBA: Observe that DTS synchronizes all MySQL data to PolarDB and then stops DTS, which takes about 1-2 minutes
  3. R&D: Set the switch state to 3 and start dual writing
  4. R&D and DBA: Check whether the number of record rows in the two databases is consistent, observe error logs, and send private messages on the app to verify whether they are available

After the migration is complete:

  1. R&D: If an error is found, set the switch status to 1 and switch back to read and write MySQL
  2. R&D: Observe for 1 or 2 days, if there is no problem, set the switch status to 4, completely read and write PolarDB, delete the MySQL information connected to the program, replace it with the PolarDB connection information, remove the switch logic from the code and release the project from the new one
  3. DBA: Observe whether MySQL still has traffic, and if not, drop the instance
  4. The migration is complete
Migrate data from a private message service to a PolarDB cluster

Online migration explained

Our purpose is to use a Redis switch to allow the program to switch between MySQL and PolarDB at any time, and to stop writing, so that we can ensure that the data in MySQL is completely synchronized to PolarDB, and the MySQL and PolarDB data are exactly the same, and finally let the traffic write both MySQL and PolarDB. If you find any exceptions, you can switch all to MySQL, otherwise switch all to PolarDB to complete the database switchover, and then turn off the switch, and the code will be completely changed to read and write PolarDB once before publishing.

There is a problem in the above process, how to stop writing data at runtime, such as add, delete, upate operations, after analyzing the code, we find that we can use the channel staging, delay the execution of the goroutine to solve. Specifically, this strategy should also be combined with the business to see if it can accept the inconsistency of short 1 or 2 minute data. If it is acceptable, then analyze how much memory can be occupied by the channels data and blocked Goroutines that are staged in these 1 or 2 minutes, whether it is necessary to increase the memory in advance, find the moment of the day with the least traffic, calculate how much buffer Chan needs by calculating the write QPS of this time, and how many Goroutines need to be blocked, after calculation, it is completely acceptable, and these steps should be best logged.

For example, if the business peak is low at 6 a.m., 500 QPM is added and 100 QPM is added to update, and a total of 10 pods of business instances are calculated.

Add: 500 / 10 = 50 Channel缓冲区长度

Update: 100 / 10 = 10 Goroutine数量

That is, each pod can add up to 10 coroutines and temporarily store 50 channels of data in this minute, which has no effect on the conversion of data to memory.

The following is a diagram of the switching status of the whole process, as follows:

Migrate data from a private message service to a PolarDB cluster
Migrate data from a private message service to a PolarDB cluster

PolarDB metrics after the migration

After the migration, the PolarDB monitoring metrics also meet expectations

Migrate data from a private message service to a PolarDB cluster

summary

Precautions

  1. Stopping the write process if there is a panic memory data loss is a problem, so pay attention to the test environment to drill the code to make sure that there is no problem, and keep a log to be safe with the attached persistent storage volume, so that the pod is really out and there is still a place to recover the data.
  2. Online migration requires a complete grasp of the current business code and cannot omit database operations, otherwise the data migration will be incomplete.

Migration results

  1. Compared with offline service outages, migration has less impact on services
  2. After the migration, the cost of MySQL was reduced by 18% compared with the previous one.
  3. THE P99 TAKES ABOUT 40 MS, WHICH IS IN LINE WITH BUSINESS EXPECTATIONS
  4. After migration, computing and storage are separated. Storage up to 100TB is supported

Author: Wang Jichao

Source-WeChat public account: Yingke Technology

Source: https://mp.weixin.qq.com/s/AywOQDw4t_XmChZ0TU_YIA

Read on