1. Introduction

WCDB is a SQLite-based terminal database open-sourced by the WeChat team. Since it was open-sourced in June 2017, it has been widely recognized and widely used in the industry, with more than a dozen versions released to date. In the process, WCDB has maintained good backward compatibility, constantly improving the details of the original interface and adding new features.

2. Challenges

However, as the most frequently used database app in China and even in the world, WeChat's internal database involves hundreds of different businesses, and the number of messages stored can reach millions or even tens of millions. This huge amount of data and increasingly rich application scenarios have brought continuous update requirements and challenges to WCDB, and the original code framework is gradually difficult to cope with.

3. Major upgrades

That's why, starting in 2019, we've decided to ditch the backward compatibility of the interface and focus on building a new, more powerful version of WCDB. After several iterations, the interface layer and core logic layer of WCDB have been comprehensively improved, and many new features have also been accumulated. Now, we're open-sourcing a new version of WCDB that is supposed to receive a major upgrade, with major changes and updates including:

• Richer development language support: Added support for C++, full support for ORMs in Java and Kotlin, covering more terminal platforms;
• Stronger SQL expression capabilities: Winq has been rewritten and enhanced.
• More secure data storage capabilities: new data backup solutions, repair solutions, etc.
• More flexible data expansion capabilities: data migration, data compression, etc.
• More granular performance optimizations: FTS5 optimizations, interruptible transactions, and more.

Change 1: Richer language support

WCDB version 1.0 supports Objective-C, Swift, and Java development languages, but the three languages of WCDB are developed independently except for the same version of SQLite and the same set of backup and repair logic. As WCDB continues to iterate, many of WCDB's new capabilities have been developed and verified on ObjC versions, while Swift and Java versions are basically in a state of stop iteration, and the differences between them are getting bigger and bigger. Ideally, different language versions of WCDB should have the same capabilities, but if the new logic of the ObjC version is re-implemented in Swift and Java, it is not only a lot of work, but also prone to errors, and it is not realistic to go online again.

Fortunately, the core logic of the ObjC version of WCDB is implemented in C++, and ObjC is only used to implement the logic of the interface layer. Many libraries that support multiple development languages use C++ to implement the core logic, while others are only used to implement the interface layer, such as the popular client-side NoSQL database component realmDB. WCDB can also be designed along these lines, so that the ObjC version of WCDB only needs to be tweaked slightly, and the core logic is completely changed to C++, and Swift and Java are bridged to the C++ core logic. In addition, in order to fully support the database development needs of different scenarios on all sides of WeChat, WCDB has also extended support to C++ and Kotlin, so as to completely cover the mainstream languages of current terminal development.

Code frameworks

After five years of precipitation, WeChat's full-platform terminal database WCDB ushered in a major upgrade!

Figure 1: Interface layer code structure

Under this code architecture, WCDB in different languages can be integrated into the same project as needed, which is conducive to saving code and reducing package size, and can also avoid conflicts in the logic of interfaces in different languages, and even use interfaces in different languages to use the same DB without worrying about any logical conflicts.

Example of an ORM implementation

In the process of supporting each language, the key problem to be solved is to design the ORM (Object–relational mapping) mechanism for each language separately. With ORM, you can read and write to a database using objects in the native language. Using C++ as an example, here's a simple object:

class Sample {
public:
      int id;
      std::string content;
      WCDB_CPP_ORM_DECLARATION(Sample)
};

You can use this C++ object directly in WCDB to read and write to the database, and you can also write expressions in the native language:

// INSERT INTO myTable(id, content) VALUES(1, 'text')
database.insertObjects<Sample>(Sample(1, "text"), myTable);
// SELECT id, content FROM myTable WHERE id > 0
auto objects = database.getAllObjects<Sample>(myTable, WCDB_FIELD(Sample::id) > 0);

The above WCDB_FIELD (Sample::id) can not only represent the id field in the table, which is used to compose various conditional expressions, but also can be used to access the id member variable in the sample instance, and then you can serialize and write a C++ object to the database, or deserialize and read it from the database, just like the Getter and Setter that contain the id member variable.

Readers may wonder how C++, as a static language, cannot obtain the metadata of the member variables of the class and the read/write interface at runtime, so how does the WCDB_FIELD (Sample::id) work? This is the difficulty of C++ ORM design. In the early days, the more mature C++ ORM solutions used precompiled methods to hardcode the metadata into the code through code generation.

Later, as the C++ template type derivation ability gradually improved, some schemes tried to record all the contents of these metadata into the type of variables, and when they wanted to use these contents, they used the template derivation ability to deduce the required information from the type of the object, which is very clever. However, the disadvantage of this is that the types of variables become very complex, and this solution is implemented in the form of a template library, which is difficult to iterate and will also bring code bloat. Taking the well-known sqlite_orm as an example, if you use it to create the table corresponding to the sample in the above example, the type of the DB object becomes very complex, and the problem of template bloat can be seen:

Figure 2: sqlite_orm example

Implement C++ ORMs with class member pointers

C++ doesn't get class metadata at runtime, but it's easy to get at compile time, and that's the class member pointers that existed before C++98. The class member pointer does not point to a specific memory location, it points to a specific member of a class, and its value is related to the position of that member in the class's memory layout. A class member pointer can be used to read and write a member variable of a class, and its type also contains the data type of the member variable and the type of the class in which it is located, as shown in the following example:

// 指向 id 成员变量的指针 memberPointer 中包含了 Sample 和 int 两个类型
int Sample::* memberPointer = &Sample::id;
Sample obj;
// 用类成员指针 写 成员变量
obj.*memberPointer = 1;
// 用类成员指针 读 成员变量
int id = obj.*memberPointer;

The class member pointer has all the information that the ORM needs, and we can use the class member pointer to implement the ORM. Because the types of class member pointers are very diverse, the function that receives the class member pointer must be written as a template, and the combination of different member pointers is easier to bring different template instantiations. To avoid code bloat, let's use the following method to remove the type of class member pointer:

template<typename T, typename O>
void* castMemberPointer(T O::*memberPointer) {
    union {
        T O::*memberp;
        void* voidp;
    };
    memberp = memberPointer;
    return voidp;
}

The value of an untyped class member pointer, while not globally unique, is unique within the scope of a given class, and can be used as a key to associate the meta information of the column name and member variables of the data table:

Figure 3: Class member pointer as key

With this mapping, we can use class member pointers to get column names, and then we can use class member pointers to represent columns in the data table.

The next step is to get the data type of the member variable. Types are just compile-time information, they don't exist at runtime, and we need to convert them to numeric values before they can be used at runtime. If you want to convert any type to a numeric value, you can't do that, and there are an infinite number of data types in C++. In fact, there are only five types of data stored in the database: integer, floating-point, text, binary, and null, and we only need to map these five types to numeric values. Because there are already data types of member variables on the class member pointer, we can extract this type, and then use the SFINAE mechanism of the C++ template to map the types that can be written to the database to these values, and complete the type-to-numeric conversion:

Figure 4: Data Type Conversion

Finally, you need to generate the read and write methods of the member variables. Because class member pointers can read and write member variables directly, a straightforward idea is to use class member pointers to construct a read-write method, and then save the function pointer to the read-write method. However, the types of these two function pointers contain the data types of member variables, and if this type is maintained, the template will still be introduced in all aspects of the storage, so this type should be removed and only an untyped pointer will be saved. When we want to read and write, if we directly call an untyped function pointer, although we can jump to the correct code address, the compiler does not know the type of input and exit parameters, which will cause an error in passing parameters, so we also need to find a way to restore the parameter type of the function pointer when calling.

Figure 5: Read/write pointer type conversion

To restore the type of the read-write function pointer, you need to read and write the type of the object and the type of the member variable, where the object itself must be used when reading and writing, then its type can be passed through the template from the upper-level call logic; This enumerated value can only describe the data type of the corresponding member variable, and cannot accurately restore the original data type. Our approach is to specify a standard type for each category type, for example, the standard type of integer is long long, and the standard type of float type is double, which can store all the values of all types in the category without losing precision. In this way, we use the standard type as an entry and exit parameter for the getter and the setter, and in the internal implementation of the getter and setter, we are responsible for converting the data of the standard type into a concrete type.

Figure 6: Standard type conversion

Change 2: Stronger SQL expression capabilities

The CRUD operations mentioned above are all convenient interfaces that can cover most of the DB usage scenarios, but a small number of complex DB operations still need to spell SQL, in fact, some of the conditional expressions written above are actually part of spelling SQL. WCDB 1.0 provides Winq (WCDB Integrated Query) to make it easier for database developers to spell SQL statements. Version 1.0 of Winq uses the C++ language to abstract and implement SQLite's SQL syntax rules, allowing developers to say goodbye to the glue code of string concatenation. By combining with the ORM at the interface layer, even complex queries can be completed with a single line of code, and with the help of the IDE's code hints and compilation check features, the development efficiency is greatly improved. For example, a query statement SQLite_sequence a table can be written using Winq as follows:

WCDB::StatementSelect().select({WCDB::ColumnResult(WCTSequence.seq)})
            .from("sqlite_sequence")
            .where(WCTSequence.seq > 1000)
            .orderBy({WCTSequence.seq.order(WCTOrderedAscending)})
            .limit(10)
            .offset(100)

It can be seen that Winq abstracts the tokens in the SQL statement into C++ classes, abstracts the connection capabilities of different tokens into C++ class interfaces, and makes the SQL statements spliced by Winq read close to the actual SQL statements through chain calls, with good readability. However, with the promotion of the application in WeChat, this version of Winq still has the following obvious problems:

1. Immediately after each API call, append the corresponding content to the SQL string. This requires that the order in which the interface is called must be strictly in accordance with the syntax order of SQL, and the original content cannot be modified after the call is made. This doesn't conform to the intuitive use of chained calls, and it's easier to make mistakes.
2. The statement created by Winq does not independently save the configuration state of each token in it, but only saves a SQL string. In this way, when the internal logic receives the Winq statement invoked by the business logic, it is only faced with the SQL string, and it is difficult to do some syntax analysis on the Winq statement or modify the Winq statement, which limits the function extension of WCDB.
3. Java, Kotlin, Swift, and other languages that can't use C++ also need to use Winq.
4. It does not support the expression of all SQL statements, and some complex statements that are rarely used can only be handwritten SQL strings.
5. Some interfaces for receiving tokens are not concise enough when using, for example, when receiving the ORM Property in .select(), you need to construct WCDB::ColumnResult first, and then explicitly convert it to an array and pass it in.

Stores the status of each token in the SQL statement

To address these issues, we've completely rewritten Winq with a new version of Winq. The new version of Winq is divided into two layers: the interface layer and the core layer, and the objects of these two layers correspond one-to-one. As the basic layer, the core layer provides the state storage of each token in the SQL statement, provides the ability to convert the current token into the corresponding SQL string, and can also verify whether the currently configured token state conforms to the syntax rules of SQL to prevent the output of wrong SQL. The interface layer object holds the corresponding core layer objects, provides a highly readable editing interface for the core layer objects, and provides a unified management logic for the cache of the core layer objects converted into SQL strings, so as to avoid repeated concatenation of strings when obtaining SQL strings multiple times. As shown in the figure below:

Figure 7: Winq 2.0

Bridging Winq to other languages

With the above design, it can already meet the SQL spelling ability of C++ and ObjC, but Java, Kotlin, and Swift also need Winq, is it possible to implement the same as C++? The answer is no. On the one hand, it's because of the heavy workload and the cumbersome alignment. On the other hand, and the most important reason, is that if all languages follow the same implementation, it is difficult to transfer the memory structure of the statement formed by spelling Winq to the core logic layer of WCDB, and only one string can be passed, which greatly reduces the usefulness of Winq. An important purpose of this Winq rewrite is to save the configuration state of each token within it independently, so that it is easy to do some syntax analysis of Winq statements or modify Winq statements, and how these capabilities work will be seen in the following chapters.

In order to face a unified Winq statement memory structure in the core logic layer of WCDB, that is, the objects of the unified core layer, we use a bridging method to bridge each Token object and its interface in Winq to Java, Kotlin, and Swift (in fact, Kotlin directly calls the implementation of Java), so that every time we spell Winq statements, we are actually operating Winq core layer objects, so that the same memory structure can be produced in the core logic layer. Taking the StatementSelect object as an example, the overall structure is as follows:

Figure 8: StatementSelect bridging example

With the new version of Winq, the above query statement can be written as follows, the parameters can be simplified, and the correct SQL statement can be output by reversing the execution order of the chain call, which is easier to use and intuitive:

WCDB::StatementSelect().select(WCTSequence.seq)
        .from("sqlite_sequence")
        .offset(100)
        .limit(10)
        .order(WCTSequence.seq)
        .where(WCTSequence.seq > 1000)

In particular, SQL statements have different assembly results according to different conditions, this complex scenario, the use of string splicing will be more stressful, and you need to deal with the connection between the top and bottom, while the use of Winq does not have these troubles, Winq is just a call to the interface at the upper level, and the bottom will automatically handle the connection of SQL, which is much more flexible, the following is an example in the Java business scenario:

Figure 9: WinQ Conditional Assembly Example

Winq supports the full range of SQL syntax

This new version of Winq fully encapsulates all 26 SQL statements supported by SQLite in each language, as well as all 23 tokens involved in these statements, so that developers can use the native syntax to spell arbitrary SQL statements in the five languages supported by WCDB, and can completely say goodbye to the problems of no input prompts and easy errors caused by spelling SQL strings. Moreover, all string parameters in Winq will be quoted in quotation marks, which can completely avoid SQL injection problems and improve security.

Figure 10: All WinQ files

Here's an example of using the new version of Winq in Java to assemble complex SQL:

Figure 11: WinQ complex SQL example

It can be seen that even if it is a complex SQL statement, the order in which Winq statements are written is basically the same as that of SQL, and people who understand SQL can read Winq without a threshold, and it will not bring much code bloat.

Change 3: More secure data storage capabilities

The first two sections gave you an overall idea of how to use WCDB, and the design goal of this part is to make it easier for everyone to store data, and how to store data more securely is the more important goal of database design, which has always been a problem that we continue to think about, and it is also the original motivation for us to expand and strengthen SQLite. Because the chat history, as the most important digital information generated by the user on WeChat, is only stored on the user's terminal device. In the event of database corruption, the chat history will be permanently lost, which is unacceptable to the vast majority of users. In order to improve data security, the new version of WCDB has the following two new designs.

1. New data backup and repair scheme

In WCDB 1.0 we introduced a database backup and repair solution, which is described in detail here, and its overall logic is as follows:

An SQLite database is a two-tiered BTree structure on a page-by-page basis, with SQLite's master tables at the top and each user-defined table at the bottom, with leaf pages being where the real data is located. When a database corruption occurs on an intermediate node, the data of all the branches below it will be lost because it cannot be found. We can back up the mapping of the table name of the lower table to the page number of the root node, and then we can solve the most serious problem, which is the corruption of the upper table. When the lower-level table is corrupted, only a single table is lost.

The backup and repair solution of WCDB 1.0 solves the urgent need of all data loss after database corruption at that time, with an average repair rate of 70~80%. However, database corruption usually occurs when the disk is damaged, and it is usually a large piece of data that is broken, so it is still a mess when it is repaired often. Therefore, the new version of WCDB is a little simpler, in addition to backing up the master table, it also adds the mapping of the table name of the backup ordinary table to the page number and crc check value of its leaf page, so that it can be done in one step, and the data of the ordinary table can be directly found according to the page number when repaired, and the crc value has not changed, you can confirm that the data is not damaged or changed, so that the undamaged data can be completely restored to the new database.

Figure 12: Master table vs. user table

There are two challenges to this solution, performance and timeliness.

Performance issues

The old solution only needed the root node page number of the lower BTree, and this only needed to traverse the master table, which was very small, and even most of the time it was in memory, so it was very fast. However, to get the leaf nodes corresponding to each table, you need to iterate through the entire database, which is very time-consuming. With the exception of IO, most of the time spent is on the application memory consumption. There are many such problems that can be solved, and almost immediately think of using mmap to solve them. But there are three further problems:

1. Segmented, on-demand mmaps. Because a single database file may be relatively large, a single map to virtual memory may fail because the virtual address space is insufficient. Therefore, according to the page size of the memory, according to the page size of each application, a total of 1MB of memory is mapped before and after it. On the one hand, a small virtual memory block is much less likely to fail due to insufficient address space, and on the other hand, a map of 1MB is generally much larger than the default memory page of the device, which can reduce the number of maps.
2. LRU Cache + Reference Count. Although the virtual memory of the map can be swapped out when the memory is low, it will crowd out the virtual address space. 64-bit machines have a lot of virtual address space, but not much free space for a single process. If the WCDB side consumes too much, it will cause insufficient address space in other places, and the memory application will fail and crash. So here we add a layer of LRU to limit the use of the upper limit of address space. At the same time, the memory reference count of mmap is released by the last holder to ensure that the map memory pointer is available. If the subsequent map directly hits the memory of the map, there is no need to mmap again.
3. Transaction backup. Have the backup operation take place in a read transactional backup of SQLite. During a read backup, the database cannot be checkpointed, and writes are only appended to WAL, in other words, the database file itself does not change. In order to prevent the data on the file from being modified, the obtained data is usually copied to a newly applied memory and then operated. However, if the data is guaranteed to be not modified, then with LRU and reference counting, you can parse the data directly from the virtual memory of mmap, without applying for memory and copying data.
图13:MMap内存LRU Cache

There are actually two ways to improve the performance of this solution, one is that mmap reads data from virtual memory, and the other is that it does not need to request memory. All three optimizations above are made to make both of them workable. At the same time, the backup data is a read operation, which can be performed on the sub-thread and does not affect other read and write operations.

After this backup scheme is launched, the average time taken for each backup of 500M DB on iPhone devices is 3 seconds, which is acceptable in most scenarios, but a small number of users will have DB of more than 1G, and the data backup operation of these users is still easy to cause the problem of hot mobile phones, especially in some scenarios of high-performance consumption such as live broadcasts or video calls, if you happen to perform data backup, then the experience of using WeChat will be significantly affected.

This version of the backup logic is a whole backup method, or you need to read all the contents of the database once, and to calculate the crc check value for all leaf pages, the amount of IO and the amount of computation is still relatively high. In order to solve these two problems, we have optimized the backup logic in more depth and introduced the incremental backup capability.

The WCDB database is enabled in WAL mode, and uses the asynchronous checkpoint mode designed by itself, every time new content is written to the database, the new content is appended to the end of the WAL file, and then the content in the WAL file is written back to the main DB during the asynchronous checkpoint. In this process, each updated page is read into memory, where the page number of those pages and the CRC check value of the leaf page can be obtained at a very low cost. Start by storing all this information in an incremental backup file. Wait until the backup is triggered, and then integrate these new content updates into the main backup file, which looks like an incremental backup of the leaf pages. The overall process is as follows:

Figure 14: The overall incremental backup process

But here we can only know that there are updated page numbers, and we don't know which table these pages belong to. Because the content of the SQLite file only saves the relationship between the parent node and the child node, and does not save the relationship between the child node and the parent node, we can only iterate through the root page and middle page of these tables to know which table the updated pages belong to. However, this traversal does not need to read the leaf page. According to statistics, the root page and middle page only account for 1%-2% of all pages, so the amount of IO and computation is greatly reduced compared to before. At the same time, the traversal of the tables that were updated at the time of the last backup is preferred, and the traversal can be stopped as long as all the pages with updates are found.

After the launch of this solution, the average time of 500M DB backup is only 63ms, and even for DB as large as 10Gbit/s, the backup time is less than 0.9 seconds, which can be said to fully meet the data backup requirements of any scenario.

Timeliness issues

The page number of the root node of the underlying tree backed up by the old solution will only change when the table is created and deleted, while the page number of the leaf node of the new solution may be changed at any time of the write operation. In addition, when incremental backups are carried out, the new content of the database has to go through two asynchronous operations, checkpoint and backup, and can be passed to the incremental backup file before being passed to the primary backup file, and it is passed in the form of increments, which is easy to make mistakes. These operations may be disconnected in the middle of the process because the user kills WeChat or the device loses power, and there is no atomicity guarantee. Backups that expire or fail to connect backup versions may result in data confusion or low repair rates. So here comes the concept of savepoint for databases.

Savepoint is represented by a combination of the salt value and the nbackfill value of the WAL file. The salt value is stored in the header of the WAL file, and each time the WAL file is written back to the main DB file, the salt value is modified to reset the contents of the WAL file, while the nbackfill indicates the offset of the content of the WAL file that has not been written back to the main DB file. After each checkpoint, the combination of these two values is updated and happens to be monotonic, and can be saved to both the incremental and primary backup files as the version number of the database contents.

图15:Savepoint

In this way, every time you update the backup file, you have to align each other's Savepoint values, if the Savepoints do not match, it means that some information is lost in the process of incremental backup, so you can perform a full backup to reset all savepoints. In practice, the probability of version misalignment is only 1 in 1,000, so in a few cases, a full backup will not have much impact on the overall performance.

Performance data for the new scenario

After the new solution was launched in WeChat, the data repair rate in the event of database corruption increased to more than 99%, and the size of the backup content was about 1/1000th of the database size. This solution can greatly reduce the data loss caused by disk corruption.

2. Prevent external logic from writing bad databases

The use of backup and repair to protect data is a relatively passive method, and the repair rate cannot be 100%, which is still not safe enough. In scenarios with a low probability of disk corruption, backup and repair methods are still available, but it is difficult to cope with large-scale database writing due to errors in external logic. We need a more proactive approach to preventing the database from being corrupted.

There will be two situations in which the external logic writes bad databases, one is to misuse the path of the database or delete the database by mistake, which is difficult to appear, and it is necessary to protect it by calling the hook system, which cannot be integrated into WCDB;

To prevent a bad database from being written when a file handle is misused, a simple idea is to try to open the database file read-only, so that external logic can't use the handle to change the database. For most database components, this is quite complex to implement. When opening the handle, you should be able to determine whether the operation will modify the database, and when you encounter the operation of changing the database after the read-only opening, you must reopen the handle of the database file.

In this configuration, even if the business logic wants to write data to the database, it does not need to modify the main DB file, only the WAL file needs to be modified, and the main DB file needs to be modified only when the checkpoint is reached. Therefore, WCDB can open the primary DB file only when the business logic reads and writes to the database, and can only write and open the primary DB file when the checkpoint is used. This minimizes the lifetime of writable handles to the main DB file and prevents external logic from being miswritten.

Figure 16: Read-only open DB file

Change 4: More flexible data expansion capabilities

With the accumulation of user data and the complexity of functions, it will become more and more difficult to meet the needs of early database table design, and WeChat has also encountered many such problems in the process of iteration:

• In the early stage of feature development, for convenience, a variety of tables that are not directly related to each other were stored in the same database. Because SQLite does not support parallel writes, it also limits the parallel update performance of different tables, which is easy to cause performance bottlenecks after data accumulation and frequent calls, and because the database is corrupted, the more frequent reads and writes are more likely to be corrupted, and putting all the data into one database will greatly increase the risk of data corruption and loss.
• Many business tables generally have one or two fields to store serialized data such as XML, Json, and PB, which are easy to add more and more attributes with the development of the business, the data is getting longer and longer, and the database is getting larger and larger, and some XML in our business scenarios are more than 10k in length. This not only does not cause space occupation problems, but also affects read and write performance, and this problem is found after a considerable amount of data has been accumulated, and the existing data is already difficult to handle.
• With the expansion of functionality, it is necessary to add content not only to XML or JSON fields, but also to add new columns to the original data table.

For these two scenarios, WCDB provides the industry's first solutions, namely data migration capabilities, data compression capabilities, and automatic addition of new columns.

1. Data migration capability

In the early days, iOS WeChat did two data migrations at the business logic level, one was the data migration of the message operation instructions, because the data volume was small, and the other was the contact data migration, because the data volume was large, and the non-blocking migration scheme was required. In the process of non-blocking migration, the data may be in three states, only the old table may be in the unmigrated state, only the new table after the migration is completed, and both tables in the migration. This part of the code is not difficult, but it is lengthy and heavily coupled with the business, which is difficult to develop and maintain, and what is more embarrassing is that it is difficult to find a suitable time to delete the compatible code, which may need to exist forever, which will affect the subsequent iterations.

To solve this problem, WCDB has come up with a concept for WCDB to solve the compatibility problem, allowing developers to develop on the assumption that the migration has been completed. At the same time, because it is a framework layer code, it is naturally code once, run everywhere, so development does not need to spend time on grayscale.

WCDB's data migration scheme is as follows: when a request for a database operation comes, it will first migrate the database handle it uses, and if it is a cross-db migration, another db will be attached to the current handle to implement SQL across db. Then check whether the old table exists, if not, it means that the migration has been completed, and the SQL is executed directly. If it exists, create a temp view to use as a subsequent compatibility. WCDB then pre-processes the database operation request before it actually executes it. This preprocessing is similar to hook-like logic, where WCDB intercepts the SQL that the developer needs to execute, and then makes some modifications and processing to give the developer the illusion that the migration has been completed. This section mainly deals with operations in addition, deletion, and modification.

Figure 17: The migration process

SQL statement preprocessing methods

Next, to introduce the preprocessing principle of commonly used CRUD statements, first look at the SELECT statement. Let's assume that for both the old and new tables, oldTable and newTable. Since the developer assumes that the migration has already been completed, he will only perform a query on the new table. WCDB then preprocesses and replaces the new table name in the operation with unionView. This unionView is created in the migration configuration, and its corresponding content is the result of merging the two tables. This way, the developer only queries the new table, and WCDB returns the result of the merge of the old and new tables to him.

Figure 18: Preprocessing SELECT

For the UPDATE and DELETE operations, WCDB also allows operations on the new table to take effect on both the old and new tables. There are slight changes here, though. Since SQLite can only update or delete the data of one table at a time, the practice here is to update the new table, then change the table name in sql to the old table, update it again, and use transactions to ensure the atomicity of this operation

Figure 19: Preprocessing UPDATE

For the INSERT operation, the situation is more complicated, and the following situations need to be considered:

1. SQLite has a hidden field rowid, which is not inserted manually by developers, but it can be autogrowing, and if the rowid is different between the old and new tables, it may cause the behavior to be inconsistent with the completion of the migration.
2. Constraints: SQLite can use some constraints such as unique constraints and primary key constraints when creating tables, so it may happen when inserting: the new table is successfully inserted, but the data actually has the same primary key in the old table.
3. Redundancy: When data is inserted into a new table, the same data may already exist in the old table. If you don't delete the data from the old table, redundancy will occur, leading to new problems. For example, the actual number of updates and deletes is inconsistent, or the results of select are redundant.

In order to solve the above problems, first of all, the data needs to be inserted once in the old table, and the constraint problem is solved here. If there is a conflict with the old data, it will fail and exit. Then save the rowid generated in the old table, and insert the data from the old table into the new table along with the rowid. Since rowid is generated from the old table, it always increments the way the old table did. Then, the rowid is used to delete the data just inserted in the old table, which also solves the problem of data redundancy. Finally, make a submission. At the same time, in terms of performance, since this is all done within a savepoint, the insertion and deletion of the old table cancel each other out when committing, and in the end, only the insert operation of the new table is written to the file, as originally expected, there is only one insert operation, so there is almost no impact on performance.

图20:预处理INSERT

Preprocessing with the new version of Winq

This problem is solved by preprocessing SQL and replacing the SQL executed by the developer with SQL that is compatible with the old and new tables, and achieves the effect of giving the illusion that the external migration has been completed. However, this solution is actually easier said than done, these few sentences of SQL are not difficult, the biggest difficulty is to modify SQL. SQL has a certain level of complexity. In the above scenario, the first step is to distinguish between SQL and SELECT, INSERT, UPDATE, and DELETE. SQL with a bit of complexity cannot be identified by simple methods such as string matching or regex. The general approach is to develop a SQL parser that parses them into virtual machine opcodes, modifies them, and reverts them to modified SQL using the same logic as SQLite. However, this method is not only highly complex, but also the performance is not guaranteed.

All SQL executed by WCDB is expressed in Winq, and the new version of Winq saves the structured data of all SQL syntax, so we can easily do syntax analysis on Winq statements and accurately modify each part of them to achieve the effect of modifying the executed SQL statements. Therefore, this database migration solution that makes developers and users imperceptible is only done by WCDB on the market, and only WCDB can do it.

Figure 21: Winq syntax analysis

Perform a true data migration

With preprocessing, developers can use the new table for development, and all that's left is to perform the real data migration. When developers don't care much about the cadence of data migration, they can directly use the WCDB automatic migration feature, which will perform data migration every 2 seconds and 10 milliseconds until the data migration is complete. If you want to speed up the migration, WCDB also provides an interface to perform the migration manually.

Read and write performance during data migration

Figure 22: Migrating database performance

In terms of performance, it is basically insensitive. With or without migration, there is no big difference in performance. Insert because the insert and deletion are offset within the same Savepoint, and there will only be one insert after the commit. For update/delete/select operations, due to the non-redundant design of data in the solution, the amount of data they operate before, during, and after migration is consistent, so there is no performance loss. At the same time, after the migration is completed, the database returns to the original logic of no migration, and the behavior is really the same, so there is no problem of deleting the legacy code.

More generalized migration capabilities

In the above introduction, the table configuration of the new table is the same as that of the old table, and both are tables with rowids, but in fact, the migration capability of WCDB extends the following more generalized capabilities:

• The configuration of the new table is different from that of the old table, only the fields of the new table are required to be a subset of the fields of the old table, and both of them are not required to have the same constraints and indexes (of course, the constraints of the old and new tables cannot conflict with the data), the new and old tables are not required to have rowids or none of them, and they are not required to have primary keys.
• The encryption configuration of the old and new databases is different, so that the unencrypted database can be encrypted or the encrypted database can be re-encrypted. Without migration capabilities, changing the encryption method of the database requires rewriting the database once, which is a very heavy operation.
• You can configure a SQL expression for the migrated table to filter some of the migrated data, which can be used to split the data of a table into multiple tables or clean up redundant data.

These capabilities are limited to space, so I won't go into them, and interested developers can try out different possibilities for themselves.

2. Data compression capability

To solve the problem of long serialized data such as XML, Json, and PB in the database, a direct way is to compress all the data and write it to the database. Generally speaking, developers need to choose a suitable compression algorithm to compress data, then introduce adding and decompression logic in all aspects of data reading and writing, mark the compressed data, and then find a way to process the stock data, and find a way to cache the various memory states of the adding and decompression process if they want to optimize the extreme performance. These things are a lot of work to deal with, and the data compression capabilities proposed by WCDB can help developers solve these problems in one step, with a simple non-intrusive configuration.

In terms of compression algorithms, it is definitely necessary to choose lossless compression algorithms. The early lossless compression algorithms were mainly divided into two categories: Huffman encoding and arithmetic encoding. Huffman encoding, which is familiar to all of you, is compressed by encoding characters with a high probability of occurring into shorter code points. The advantage of this type of algorithm is that the encoding speed is fast, but the compression ratio of Huffman encoding can only reach the Shannon limit when the probability of each character is a negative integer power of 2, and it cannot be reached in other cases, so the compression ratio is low.

Arithmetic encoding, on the other hand, converts the entire string to a decimal between 0 and 1 based on the probability of the entire string occurring. Since this decimal accurately represents the probability of the occurrence of a string, the compression ratio of the arithmetic encoding is close to the Shannon limit. However, due to the large number of multiplication and division involved in the codec process, its performance is inferior to that of Huffman coding.

ANS+FSE encoding is a new algorithm released in 2014 that encodes the entire string into an integer greater than 1, which is precisely related to the probability of the string's occurrence, so the compression ratio of this algorithm can also approach the Shannon limit. At the same time, because its computational process only involves addition, shift and mask calculation, and its performance is closer to Huffman encoding, it is currently considered to be the algorithm with the best compression ratio and performance. The best implementation of this algorithm is known as ZSTD.

However, Zstd's normal compression mode only addresses redundancy within a single XML or JSON. Since different XML or Json have similar tags, constantly storing these tags also creates a lot of redundancy. To solve this problem, Zstd's dictionary compression mode can effectively eliminate the similarities between different data, significantly improving the compression ratio and improving performance. Therefore, the Zstd dictionary compression mode is considered to be the optimal solution for compressing serialized data at present. It is also expected that in the future, it is unlikely that there will be a compression algorithm that can significantly improve the compression ratio. Therefore, WCDB mainly uses the Zstd dictionary compression algorithm for data compression.

Now that we've decided on the compression algorithm, let's take a look at the overall framework of data compression:

Figure 23: The overall process of data compression

When new data is written by external logic, the data will be compressed inside WCDB and then written to the file, and when the data is read, the compressed data will be decompressed by WCDB and then sent to the outside. At the same time, WCDB will also process the stock data in the subthread, read the uncompressed data, compress it, and then update it back. In this way, the external only needs to configure which table and which field of the database needs to be compressed, and when CRUD, you can assume that the data is uncompressed to operate the data, and you do not need to pay attention to the implementation details and internal state of data compression, and the entire processing process can be imperceptible and non-intrusive externally. In this way, data compression can be easily extended in different business scenarios and different platforms.

In the case of external logical CRUD, in order to hide the details of data addition and decompression, you need to do some processing and transformation of the SQL to be executed in WCDB. First of all, if a table has a field configured with a compressed field, a corresponding field will be added to the compressed field of the table one by one, the field that stores the compression state, the state field stores the state of whether it is compressed, and the algorithm used for compression, and then it is necessary to preprocess the SQL to convert the read and write of the compressed field in SQL into a combined read and write of the compressed field and the compressed state field, so that the addition and decompression logic can be introduced.

Figure 24: CRUD-compatible approach to data compression

The principle of preprocessing here is similar to that in the previous chapter on data migration, but the preprocessing of INSERT, UPDATE, SELECT, and DELETE statements is different, which will be introduced one by one.

SQL statement preprocessing methods

First of all, look at the INSERT statement, if the statement is relatively simple, the uncompressed data it writes is easy to intercept inside WCDB, then compress the data, and then write it together with the compressed state value.

图25:压缩预处理INSERT1

This field is WCDB_CT_content here, which is the compressed state field of the content field, which is prefixed. In business practice, most of the insertion statements are in this simple form, which can be handled in this way, and the performance impact only increases the consumption of compressed data. Of course, there are occasionally some complex insert statements that require more complex processing. For example, this INSERT statement with a conflicting update operation, or an INSERT statement where some of the inserted values are read from a SELECT statement. In these cases, it is difficult to determine the specific value of the data it is going to write to, so it cannot be compressed directly. If it would be too complex to deal with each case individually, WCDB takes a unified approach to these complex INSERT statements:

图26:压缩预处理INSERT2

First, let the INSERT statement be executed directly, so that you can obtain the rowid of the newly inserted data, and then read out the newly inserted uncompressed content according to this rowid, compress it, and then update it to the table. Although there are many statements executed in this way, because they are all in a transaction and the data is still in memory, it will not increase the amount of IO, and the impact on performance will not be too obvious.

It's similar for the UPDATE statement. If the structure of the statement is simple, the uncompressed data written by the update can also be obtained, just like the INSERT statement, the data is compressed, and then updated along with the compressed state value.

Figure 27: Compression preprocessing Update1

Most UPDATE statements can be handled in this way, but there are some complications that need to be discussed separately. For example, the updated value in the following UPDATE statement is read from a SELECT statement, so that the specific value to be updated cannot be easily obtained.

Figure 28: Compression preprocessing Update2

First, read the rowid of the row that meets the UPDATE condition, then update it row by row with the rowid, and then read out the updated data, compress it and write it in. This seems to be a lot more execution, but because they are all in one transaction and each updated piece of data is still in memory, it will not increase the amount of IO, and the impact on performance is limited.

For SELECT statements or DELETE statements, it's much simpler. You only need to define a decompression function, which receives the compressed field and the compressed status field for decompression, such as the decompress function in the following example, and then replaces all the places used in the SELECT/DELETE statement to compress the field with the decompression function, so that the data can be decompressed and then used:

Figure 29: Compression preprocessing SELECT

Compress the stock data

After being compatible with CRUD, WCDB only needs to compress the existing data slowly. The general process of processing stock data is to read out a batch of data that needs to be compressed in the whole row, compress them, and cache them in memory, and then delete all these rows, and then reinsert the compressed data into them line by row. The reason for deleting and reinserting here is to trigger sqlite to reformat the storage location of these rows, so that the storage layout is more compact. If it is directly updated back to the original position after compression, the spacing between rows will still be relatively loose, and the compressed space will not be fully utilized.

However, this also requires that the entire batch of data must be processed in a single transaction in its entirety, and cannot be committed in the middle of the process, otherwise there will be data loss. In order to better reformat the storage space, each batch of 100 rows of data is processed, and the entire transaction time may reach the level of 100 milliseconds, which is easy to get stuck in the writing logic of the UI thread and cause the UI to freeze. Fortunately, WCDB has a complete SQLite lock monitoring mechanism, which can easily monitor whether there is external logic blocked by the operation of the current thread, so that you can detect it every time you perform a small operation, and if there is an external logic blocking, you can roll back the transaction first and redo it next time. This also avoids the performance impact on the external logic.

Figure 30: Asynchronous compression process

Performance

WCDB's data compression capability has now been applied to compress Official Account messages in WeChat, and the compression rate of XML fields is as high as 89%, which is quite high.

In terms of read and write performance, if it is sequential read and write, the performance will be reduced by about 3%~4%, mainly due to the performance consumption caused by data processing and decompression. Because it is sequential read and write, the performance increment caused by the reduction of IO is not obvious. If it is a random read and write, the performance is improved, and the performance of random write is slightly improved, which is mainly written randomly in WAL mode, which is append at the end of the WAL file, and cannot be truly random, so the performance improvement is still not obvious. Random read can achieve true random IO, so the performance increment brought by the reduction of IO is very obvious, and the random read performance is increased by more than 30%, which is still considerable.

Therefore, data compression is also a positive optimization effect in terms of read and write performance, coupled with the fact that it can greatly reduce the data footprint, data compression is worth promoting to various business scenarios, simple configuration, non-intrusive, and beneficial and harmless in all aspects.

More extended compression capabilities

In order to cope with the diverse demand scenarios and complex data environment in WeChat, we also extend many capabilities to data compression:

• Support a variety of compression methods, including the default compression method of Zstd, single dictionary compression mode, and multi-dictionary compression mode, multi-dictionary compression can use different compression dictionaries according to the value of a field in the table, which is mainly used in scenarios such as multiple heterogeneous XML/JSON/PB stored in the same field.
• Support multi-field compression, a table that is being compressed can add new compressed fields at any time to meet the needs of scalability.
• Data compression and data migration can be performed independently at the same time, developers can configure data compression for a table being migrated at the same time, so that the data will be compressed during migration and then written to the new table, and the compression and migration can start and end independently without interfering with each other.

The details of the implementation of these abilities will not be introduced due to the length of the article, but readers can try it out for themselves if they are interested.

3. The ability to automatically fill new columns

Business logic may add new columns to the original table in the process of development iteration, SQLite supports adding new columns to existing tables, and WCDB will automatically add newly configured columns in the ORM class when calling createTable, but this kind of error is still very common in our practice. One reason may be the negligence of the development classmates, you must take the initiative to call the logic of adding new columns before using the table, relying on the consciousness of the development classmates, it is easier to miss when multi-person collaborative development, and the other reason may also be that you can't find the right time to add new columns, such as the scenario where many tables correspond to a unified ORM class. If you want to add a new column to these tables, you can't find a unified processing time, because heavy users may have thousands of such tables, and if they are processed together, it will be time-consuming and easy to cause lag, and if you judge whether you need to add a new column every time you read and write these tables, it will significantly reduce the performance.

All columns in a table are configured in their corresponding ORM class. Ideally, once a developer has configured a new column in an ORM class, it should be considered to take effect immediately, and the developer should not care when the new column is added. In order to achieve this effect, WCDB has added the ability to automatically fill new columns, the core idea is this, when reading and writing the database, if there is an error that there is an unrecognized column, immediately check whether the ORM class corresponding to the read and write table has a newly configured column with the same name as the unrecognized column, if so, add the newly configured column to this table, and then retry the error logic. By using this error re-checking method, you can minimize the invocation time of the logic of checking new columns, and comprehensively deal with the problems caused by new columns not being added to the database in time.

Figure 31: The process of auto-filling new columns

The ability to auto-fill new columns is ideal in terms of performance impact and completeness of problem solving, but it is also difficult to implement. There are two main problems to solve, one is how to get the corresponding ORM class of the table when there is an execution error, and the other is how to avoid adding the wrong column to the table.

For the first problem, because when you want to use the column configured by the ORM class, you can get the column name of the column configuration from the internal information of the class, so that you can use the column name to construct a Column object for assembling statements in Winq, for example, the WCTSequence used in the above example is to call the method of the WCTSequence class to obtain the column name of the seq property configuration to construct a Column object. Therefore, when we use this method to construct a column, we can pass in the database configuration information of the entire ORM class and save it in the column, so that we can get all the configuration information of the ORM class where the column used in it is located in the Winq statement. Since the ORM information is a global quantity stored on the heap, this change is really just passing and saving one more pointer, and has no performance impact on Winq.

If that's not enough, what we actually need to know is the ORM information for the table involved in the Winq statement, not the column. Here we use the method of discarding some scenarios, only dealing with the scenario of reading and writing a single table, and the missing columns corresponding to two different ORM classes in the Winq statement are also abandoned, and the situation of operating multiple tables or using multiple ORM classes in a SQL statement is still very rare in practical applications.

For the second problem, there are two main situations that need to be solved:

1. Prevent SQLite from false positives for unrecognized columns. For example, SELECT city FROM China WHERE city MATCH 'Guangdong: Guangzhou' will report no such column: Guangdong, but this column does not actually exist, but the search syntax of fts mistakenly recognizes the part before the colon as the column name. This situation can be solved by extracting the column names in the error message to match the column names in the Winq statement.
2. Prevent developers from mistakenly adding all the columns of the ORM class configuration when using the wrong ORM class. When writing Winq statements, even if there is a hint for input, writing errors cannot be completely avoided. This can be solved by detecting that half of the columns configured in the matching ORM class must have been added to the table. In extreme cases, even if some columns are added by mistake, as long as the columns do not actually write data, they will not occupy storage space and affect read and write performance.

Change 5: More extreme performance optimization capabilities

1. FTS5 optimization

From 2020 to 2021, iOS WeChat changed the three main local search logics of contact search, chat history search, and favorite search to SQLite's FTS5 component, and WCDB also took this opportunity to improve FTS5 support, optimize FTS5 read and write performance, redesign FTS5 tokenizer, and enrich the capabilities of the tokenizer, and also support pinyin search, for details, see "iOS WeChat full-text search technology optimization": https://mp.weixin.qq.com/s/Ph0jykLr5CMF-xFgoJw5UQ 。

2. Interruptible transactions

In scenarios where a large amount of data needs to be updated to the database, our development habit is to unify these update operations to sub-threads, so as to avoid blocking the main thread and affecting the user experience. In WeChat, this scenario includes receiving messages, cleaning up Moments data, cleaning up Channels data, etc., receiving messages may receive hundreds or thousands of messages at a time, and the data of Moments and Channels will be stored in the database after being pulled down, but it does not need to be stored permanently, and expired data needs to be cleaned up regularly.

In this scenario, simply placing the data update operation on a sub-thread cannot solve the problem. Because SQLite's same DB does not support parallel writing, if the data update operation of the sub-thread takes too long, and the main thread has data write operations, such as the user will send a message at the same time as receiving a message, this will also cause the main thread to block. In the past, the data update operation of the sub-thread was split into independent operations with little time consuming and executed separately, for example, the received messages were written to the database one by one, which could avoid the problem of blocking the main thread, but would lead to a large amount of disk IO and increase the time consumption of the sub-thread. Because SQLite reads and writes a database in units of a data page, and the size of a data page is 4kb in WCDB, a single data page can generally store multiple messages, and writing each message easily causes the same data page to be read and written multiple times. In order to reduce the amount of disk writes, all data update operations can only be executed in a single transaction, which will cause the problem of main thread blocking.

Figure 32: Example of writing a message to an incoming message

To solve the problem that large transactions block the main thread, we have developed an interruptible transaction in WCDB. Interruptible transactions treat a long transaction process as a loop logic, and each loop performs a short DB operation, such as writing a new message. After the operation, it determines whether the current transaction can be terminated based on the parameters passed in from the outside, and if it can be terminated, it directly commits the transaction and writes the transaction modification to disk. If the transaction cannot be terminated, then determine whether the main thread is blocked by the current transaction, and if not, call back the external logic and continue to execute the subsequent loop until the external logic is processed. If it detects that the main thread is blocked by the current transaction, it immediately commits the transaction, writes some of the changes to disk first, and wakes the main thread to perform DB operations. After the DB operation on the main thread is executed, a new transaction is reopened so that the external can continue to execute the previously interrupted logic. The overall logic of an interruptible transaction is shown in the following diagram:

Figure 33: Interruptible transactions

Interruptible transactions allow a series of DB operations to be executed in a transaction as much as possible, and at the same time, it can respond to blocking events of the main thread in a timely manner, avoiding the lag problem of the main thread. Because a transaction may be split into multiple commits, the atomicity of the transaction as a whole is not guaranteed, which requires the attention of the user, and if necessary, there needs to be an additional mechanism to ensure the atomicity of the transaction.

3. WAL file header update and optimization

The header of the WAL file stores key information such as the version number, page size, salt value, and check value of the WAL file, and the content of the file header needs to be updated every time the first page of the WAL file is written (only this time is the time to update the header of the WAL file). Earlier versions of SQLite (the version of SQLite prior to WCDB 1.0.8) wrote to the WAL header by simply writing the content to the disk cache without calling fsync. SQLite later discovered that if the disk cache is written to disk randomly, there may be cases where content other than the WAL header has been written to disk but the file header has not been updated, resulting in database corruption (see https://sqlite.org/src/info/ff5be73dee for details). Therefore, the current version of SQLite will call fsync to write the disk cache to the disk after writing the WAL file header, which will cause the time to write the first frame of the WAL file from about 5ms to 100ms, which is easy to cause lag, which used to be the number one common problem of iOS WeChat's database lag.

To solve this problem, WCDB has modified the SQLite source code to update the WAL file header. Under the WCDB configuration, there are two times to write to the first page of the WAL file, one is the first time the data is written after the database is created, and the other is when the contents of the WAL file are completely checkpointed. For the first time, it is not possible to optimize it, but for the second time, you can advance the update of the WAL header to the checkpoint.

The specific logic is as follows: after the checkpoint ends, if no other threads are reading or writing the WAL file at this time, the lock is added to prevent other threads from writing the WAL file, and the lock is rewritten by syncing the file header of the WAL file. In the first frame of the WAL file, if you find that the WAL file has not been created or the file header has not been rewritten, try to sync to rewrite the file header. Because checkpoints are executed by sub-threads, and there are not many times to read and write WAL files, this optimization can put most of the update operations of WAL file headers to sub-threads to avoid causing UI lag. After the optimization was launched, the number of lags on iOS WeChat was reduced by 5%~10%.

Figure 34: WAL header update optimization

At this point, the main changes and updates of the new version of WCDB are introduced.

Fourth, open source address

The new version of WCDB has been open-sourced on Github: https://github.com/Tencent/wcdb, welcome Star!

5. Summary

At the interface level, the new version of WCDB fully supports the five major terminal development languages of C++, Java, Kotlin, Swift and ObjC, covering the four major terminal platforms of Android, iOS, Windows and Linux. We've also rewritten and enhanced Winq to enable developers to write arbitrary SQL using native syntax in a variety of languages.

At the functional level, the new version of WCDB introduces a new data backup and repair solution, which greatly improves the data recovery rate and reduces the performance consumption of data backup to negligible. In addition, we have highlighted two new features, data migration and data compression, so that developers can efficiently handle the challenges of over-aggregation and over-bloat in complex businesses with a simple configuration. The new version of WCDB also introduces new features such as FTS5 optimizations and interruptible transactions, allowing developers to achieve extreme performance optimizations in specific scenarios.

作者:qiuwen

Source: WeChat public account: WeChat client technical team

Source: https://mp.weixin.qq.com/s/RWCqLD0M_WGCrCcz0oQIcQ

After five years of precipitation, WeChat's full-platform terminal database WCDB ushered in a major upgrade!

1. Introduction

2. Challenges

3. Major upgrades

Change 1: Richer language support

Code frameworks

Example of an ORM implementation

Implement C++ ORMs with class member pointers

Change 2: Stronger SQL expression capabilities

Stores the status of each token in the SQL statement

Bridging Winq to other languages

Winq supports the full range of SQL syntax

Change 3: More secure data storage capabilities

1. New data backup and repair scheme

Performance issues

Timeliness issues

Performance data for the new scenario

2. Prevent external logic from writing bad databases

Change 4: More flexible data expansion capabilities

1. Data migration capability

SQL statement preprocessing methods

Preprocessing with the new version of Winq

Perform a true data migration

Read and write performance during data migration

More generalized migration capabilities

2. Data compression capability

SQL statement preprocessing methods

Compress the stock data

Performance

More extended compression capabilities

3. The ability to automatically fill new columns

Change 5: More extreme performance optimization capabilities

1. FTS5 optimization

2. Interruptible transactions

3. WAL file header update and optimization

Fourth, open source address

5. Summary

Read on