laitimes

Break through the bottleneck of data storage, and transform the practice of 100 million-level data storage optimization in the business and financial system

author:Flash Gene

1. Background

1.1 Status

At present, the transfer business finance system receives the data of various upstream business systems (such as order, OMS, payment, after-sales and other systems), converts them into financial data, and finally outputs financial-related reports and indicator data to help the company effectively carry out financial management and decision-making.

In 2021, the construction of the transfer business and financial system began, and in order to meet the needs of the short-term launch, the company chose to actively receive data from the upstream business system. However, as time goes on, the amount of data continues to grow, and the system has reached the edge where it cannot be handled, causing many problems. Therefore, we need to optimize the data storage.

1.2 Data volume statistics

Statistics of the large amount of data in the business and finance system:

Table Name, Number of Rows, Data Length, Index Length, Outbound Schedule: 10628017629.48GB34GB: Outbound Single Header Table: 253441107GB6GB: Inbound Schedule: 227669108GB5GB: Sales Order Table: 2957865910GB9GB: Notes Receivable: 246862675GB2GB: Inbound List: 207774574GB6GB: Payable: 153877244GB2GB

The following table shows the trend chart of data increment in tables with large data volumes, which can be observed that due to the increase in new services in recent months, the monthly data increment has reached 10 million.

Break through the bottleneck of data storage, and transform the practice of 100 million-level data storage optimization in the business and financial system

1.3 Slow queries

As you can see from the slow query monitoring platform, the number of slow queries per day has reached the 1,000 level. Slow queries not only affect the user experience, but also consume a large amount of machine resources, which may seriously lead to machine downtime. In addition, the MySQL database architecture is a single-machine multi-instance, and multiple sets of cluster instances are deployed on one physical machine, so it will not only affect the system itself, but also drag down other clusters, causing a snowball effect.

Break through the bottleneck of data storage, and transform the practice of 100 million-level data storage optimization in the business and financial system

2. Design goals

2.1 Solve the problem of data volume

In the next five years, regardless of the amount of database data, it can easily cope with future business growth and cover the company's entire business, and has good scalability, and can finally output more data reports stably to the outside world.

2.2 Solve the read and write performance

Through this optimization, the efficiency of report query is improved, the execution time of scheduled tasks is reduced, and the problem of task failure and interface timeout caused by slow query is avoided, and service stability is improved.

3. Scheme selection

3.1 DB storage solution selection

To solve the problem of the amount of data in the underlying table, we compare the following four solutions:

  • Solution 1: Database and table sharding
  • merit
  1. Spread data across multiple databases and tables to reduce the load on a single database. This improves the read and write performance and response speed of the database, and reduces query latency.
  2. The split table structure is the same, and the program is less modified.
  • shortcoming
  1. Sharding rules need to be planned in advance, and once the rules are set, it is difficult to move, and the scalability is relatively poor.
  2. Split rules are hard to abstract.
  3. Cross-library transaction issues.
  • Applicable scenarios
  1. Faced with the pressure of high concurrent access and the storage of massive data, the database needs to adopt both table sharding and database sharding strategies to expand the concurrent processing capacity of the system and improve the query performance of a single table.
  2. Data has a unified business rule primary key, so that the data can be evenly distributed.
  • Analysis of the applicability of the financial system
  1. As the underlying system, the business and financial system accepts the data of various business systems, and the data is relatively diverse and complex, and it is difficult to define a business primary key, and it is difficult to distribute the data evenly.
  2. If the data volume of a business grows rapidly or other business data is accessed, it may face data volume problems.
  • Scheme 2: Hot and cold storage
  • merit
  1. Moving infrequently accessed data from online storage to archive storage reduces the capacity requirements for online storage, thereby reducing storage costs.
  2. Reduces the amount of data in online storage, so database read and write performance can be improved.
  3. Historical data can be stored for a long time, avoiding data loss.
  4. Data can be backed up to a different storage location for data recovery when needed.
  • shortcoming
  1. It is necessary to ensure the transactionality of archiving and prevent archived data from appearing in hot and cold databases at the same time, and data duplication occurs.
  2. You need to consider the right archiving strategy that doesn't impact service access.
  3. There needs to be clear business boundaries, and complex business data is not applicable.
  • Applicable scenarios
  1. There is a large amount of historical data in the database, and the query frequency is relatively low.
  2. Writes to the database are more frequent than reads.
  3. The storage cost of the database is high, and it is necessary to reduce the cost.
  • Analysis of the applicability of the financial system
  1. The business data of the business and financial system is complex, and the historical data will be changed and queried at this stage, the time caliber is not uniform, and the boundary is relatively blurred, so it is impossible to confirm an accurate boundary.
  2. Considering the subsequent access of more business data, since it is currently impossible to unify the data format, it may be necessary to reconsider issues such as boundaries.
  • Solution 3: TiDB
  • merit
  1. Highly compatible with MySQL: In most cases, you can easily migrate from MySQL to TiDB without changing the code.
  2. Horizontal elastic scaling: TiDB can be horizontally scaled by simply adding new nodes, and the throughput or storage can be scaled as needed, easily coping with high-concurrency and massive data scenarios.
  • shortcoming
  1. There are still some features and behaviors of MySQL that TiDB does not support or perform differently from MySQL at the moment.
  2. The system is complex and has too many components.
  • Applicable scenarios
  1. Scenarios that require high data consistency, high reliability, high system availability, scalability, and disaster recovery.
  2. OLTP scenarios that require high storage capacity, scalability, and concurrency.
  3. Data aggregation and secondary processing.
  • Analysis of the applicability of the financial system
  1. Since TiDB is compatible with MySQL, there are few changes.
  2. In recent years, there is no need to consider the amount of data, and more diverse data can be accessed.
  3. TiDB can support the need to add or subtract columns from large tables, with high scalability, and is currently more in line with the current situation of business and finance.
  • 方案四:OceanBase
  • merit
  1. High performance: The read/write splitting architecture is used to divide data into baseline data and incremental data. Incremental data is stored in memory (MemTable) and baseline data is stored in SSDs. Modifications to the data are incremental data, and only memory is written. So DML is a full memory operation with very high performance.
  2. High compatibility: Compatible with common MySQL/Oracle functions and MySQL/Oracle backend and backend protocols, services can be migrated from MySQL/Oracle to OceanBase with zero or minimal modifications.
  3. High availability: Data is stored in multiple replicas, and the failure of a few replicas does not affect data availability.
  • shortcoming
  1. The requirements for the environment are extremely high, and it is necessary to purchase and use the servers specified by them.
  2. The cost of learning and O&M is relatively high.
  3. Although OceanBase has high availability, its implementation still depends on the underlying hardware and network stability.
  • Applicable scenarios
  1. Financial-grade data reliability requirements. In the financial environment, data reliability is often required, and logs are synchronized and persisted in multiple data centers in real time for each transaction committed.
  2. Databases are faced with a rapidly growing volume of business data.
  • Analysis of the applicability of the financial system
  1. At present, there is no maintenance in O&M, so this solution is not considered, and you can refer to whether this solution is applicable to your own system.

Based on the analysis of the above schemes, TiDB is currently the most suitable solution for the transformation of the business and financial system. This solution can solve the problem of data volume in a short time, and the cost of modification is relatively low.

3.2 Slow query optimization scheme

After analyzing the slow query statements, it is found that most of the slow queries are caused by the joint table query, so this time the problem is mainly solved. The following table compares the solutions of the joint table, and the ES solution is selected based on the applicable analysis.

1. A wide table may contain a large amount of duplicate data, resulting in a waste of storage space. This increases the storage requirements of the database, especially on large datasets

2. Due to the large number of columns and associated data involved, subsequent performance tuning may require more considerations and may require complex indexing strategies

3. The complexity increases and the amount of changes is relatively largeES1.Solve the problem of joining tables by establishing indexes, which also improves the query efficiency

2. The follow-up scalability is relatively high, and it is easy to achieve by adding query conditions

3. The data source needs to be consistent with the Elasticsearch data

4. It can reduce the amount of existing database index data

4. Program practice

4.1 Practical steps of the program

According to the solution selection analysis, the most suitable solution for the current situation of the financial system is to switch the underlying data storage first, and then connect to ES. Before implementing these two schemes, we need to consider their priorities and analyze the current state of the financial system. Due to the sudden increase in data volume, considering the existing business and subsequent new business, and without affecting the existing use, the first problem to be solved is the amount of data. Therefore, we recommend switching the underlying data store first. The advantage of this is that we can still roll back to the original data store even if we encounter problems in subsequent implementations. This ensures data integrity and reduces risk during implementation. On the other hand, if we choose to connect to Elasticsearch first, we need to consider how to ensure the data integrity during the data switching process, and the synchronization method also needs to consider the compatibility between the two different data storage schemes, which will add a lot of additional workload and risks.

To sum up, the optimization step we chose is to first switch the underlying data store and then connect it to Elasticsearch after it is stable. This can effectively solve the current data volume problem while ensuring the stability and data integrity of the system. Subsequently, we can continue to integrate ES to further optimize the performance of the financial system.

4.2 Switch the underlying data storage steps

When choosing the data migration method, considering that the real-time requirements of the business and financial system are not very high, and most of the current data access and writing methods are evaluated, it is acceptable to stop writing for a few minutes, which greatly reduces the entire data migration cost.

The migration process requires:

  1. Check whether TiDB is compatible with SQL statements in the current service to ensure that no errors are reported after the migration.
  2. The integrity of the data must be ensured, and the data of the MySQL database and the TiDB database must be strictly consistent after migration.
  3. If a problem occurs during the migration process, you can immediately roll back to the MySQL database without affecting system availability.

4.3 Access to ES

  1. Based on the functions of the report query page and the SQL analysis of the concatenate table, we designed the index model, the core of which is to optimize the query performance and improve the response speed of the system.
  2. After establishing the indexing model, we need to consider how the incremental data is synchronized between the database (DB) and Elasticsearch (ES).

The following table compares four different synchronization methods, and based on the designed index analysis, we decided to use the data subscription method for synchronization considering that each index involves a large number of tables, related service codes have not yet been closed, and the requirements for real-time performance are high. Among the implementations offered by the current company, we chose Kafka.

Advantages and disadvantages of synchronous dual writing This method is simple and crude, with high real-time performance: 1. Business coupling: This method has strong code invasiveness, coupling a large amount of data synchronization code, and it is necessary to write ES code where DB is written

2. Affect performance: If you write to two storages, the response time will be longer, and the performance of the system will inevitably deteriorate

3. Inconvenient expansion: The search may have some personalized needs and need to aggregate data, which is inconvenient to implement

4. High risk: There is a risk of data loss due to double write failure, and asynchronous double write 1. High performance

2. It is not easy to have data loss problems

3. Multi-source writes are isolated from each other, which is convenient for expanding more data source writes1. Hard-coding problem, access to new data sources requires the implementation of new consumer code

2. The system complexity has increased, and message middleware has been introduced

3.MQ is an asynchronous consumption model, the data written by the user may not be immediately visible, resulting in delayed regular synchronization is relatively simple

2. High pressure on storage, data subscription, 1. Less business invasion

2. The real-time performance is relatively high, and the data subscription framework needs to be selected, and the system complexity increases

  1. After the incremental data synchronization, the last step is to complete the synchronization of historical data, this time we chose the synchronization method is the ECP provided by the company, you can refer to the article: Incredible! Billions of data can be synchronized to ES so easily!

5. Summary and results

At present, the business and financial system has successfully completed the switch of the underlying data storage, and it can be seen that in recent years, it is no longer worried about the problem of data volume storage, and more business data has been successfully accessed. With the introduction of Elasticsearch (ES), business personnel no longer report report page timeouts and other problems. This optimization of data storage is essentially a refactoring of the system, and the selection of the solution takes into account the factors that have a small impact on the system and does not affect the use of business personnel, which is also the core of the optimization.

Due to historical reasons, there are still many aspects that need to be optimized in the business and financial system, such as continuous governance of slow SQL and optimization of scheduled tasks. Therefore, we need to maintain the core concept of this optimization and continue to improve it in the subsequent refactoring to make the business and financial system more stable.

About the author

Dai Meiqi, R&D engineer of the middle office of the transfer transaction

Source-WeChat public account: Zhuan Zhuan Technology

Source: https://mp.weixin.qq.com/s/jyUunH1NyGUMo_0o0OZbWg

Read on