laitimes

40x faster text retrieval performance and in-depth interpretation of inverted indexes in Apache Doris

author:SelectDB

In the OLAP space, Apache Doris has become synonymous with high performance, high concurrency, and high timeliness. In addition to factors such as hardware configuration, cluster size, and network bandwidth, the core of performance improvement lies in how to minimize the CPU, memory, and IO overhead of SQL execution, in which database indexes play a crucial role. A reasonable index structure design can skip a large number of unnecessary underlying data reads, quickly retrieve and locate the required data, further improve the execution efficiency of subsequent calculations, and reduce the running time and resource consumption of SQL queries.

Apache Doris provides rich indexes to speed up data reading and filtering, and can be roughly divided into two types: intelligent built-in indexes and user-created indexes according to whether they need to be manually created by users, where intelligent built-in indexes refer to indexes that are automatically generated when data is written without user intervention, including prefix indexes and ZoneMap indexes. To create an index, you need to manually create an index based on your business characteristics, including the Bloom Filter index, the inverted index, and the NGram Bloom Filter index added in version 2.0.

Compared with the familiar prefix index and Bloom Filter index, the new inverted index and NGram Bloom Filter in version 2.0 have more significant performance improvements in scenarios such as text search, fuzzy matching, and non-primary key column search. This article will take the Amazon customer reviews dataset as an example to describe how Apache Doris makes full use of inverted indexes and NGram Bloom Filter indexes for query acceleration when querying this dataset and similar scenarios, and explains how it works and best practices in detail.

Sample dataset

For the purposes of this article, we used a dataset of about 130 million user reviews for Amazon products. The dataset exists in the form of a Snappy-compressed Parquet file with a total size of about 37GB. Here's an example of a dataset:

40x faster text retrieval performance and in-depth interpretation of inverted indexes in Apache Doris

In the subset, each row contains 15 columns of information, such as user ID (customer_id), review ID (review_id), purchased product ID (product_id), product category (product_category), rating (star_rating), review title (review_headline), review content (review_body), etc. As you can see from the above, the columns contain various characteristics that are suitable for index acceleration. For example, customer_id is a high-cardinality numeric column, product_id is a low-cardinality fixed-length short text column, product_title is a short text column suitable for text search, and review_body is a long text column for text search.

With these columns, we can simulate two typical index query scenarios, as follows:

  • Text search query: Search for product information that contains specific content in the Review body field.
  • Non-primary key column detail query: Query the review information of a specific product ID (product_id) or a specific user ID (customer_id).

Next, we will compare the difference in query performance with and without indexes using text search and non-primary key column detail queries as the main direction. At the same time, we will also analyze in detail the principle of indexing to reduce query time and improve query efficiency.

Environment setup

In order to quickly set up the environment, create clusters, and import data, we use a single-node cluster (1FE, 1BE) and follow the steps below:

  1. Setting up Apache Doris: For more information, see Quick Start
  1. Create a data table: Perform the following table creation statement to create a data table
CREATE TABLE `amazon_reviews` (  
  `review_date` int(11) NULL,  
  `marketplace` varchar(20) NULL,  
  `customer_id` bigint(20) NULL,  
  `review_id` varchar(40) NULL,
  `product_id` varchar(10) NULL,
  `product_parent` bigint(20) NULL,
  `product_title` varchar(500) NULL,
  `product_category` varchar(50) NULL,
  `star_rating` smallint(6) NULL,
  `helpful_votes` int(11) NULL,
  `total_votes` int(11) NULL,
  `vine` boolean NULL,
  `verified_purchase` boolean NULL,
  `review_headline` varchar(500) NULL,
  `review_body` string NULL
) ENGINE=OLAP
DUPLICATE KEY(`review_date`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`review_date`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"compression" = "ZSTD"
);           

3. Download the dataset: Download the dataset separately from the link below, the dataset is in Parque format, compressed by Snappy, and the total size is about 37GB

  • amazon_reviews_2010
  • amazon_reviews_2011
  • amazon_reviews_2012
  • amazon_reviews_2013
  • amazon_reviews_2014
  • amazon_reviews_2015

4. Import datasets: After the download is complete, run the following commands to import datasets:

curl --location-trusted -u root: -T amazon_reviews_2010.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2011.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2012.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2013.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2014.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
curl --location-trusted -u root: -T amazon_reviews_2015.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load
           

5. View and verify: After completing the preceding steps, you can run the following statement on the MySQL client to view the number of imported data rows and the space occupied. As you can see from the code below, a total of 135589433 rows of data are imported, which takes up 25.873GB of space in Doris, which is a further 30% reduction compared to the compressed Parquet columnar storage.

mysql> SELECT COUNT() FROM amazon_reviews;
+-----------+
| count(*)  |
+-----------+
| 135589433 |
+-----------+
1 row in set (0.02 sec)
mysql> SHOW DATA FROM amazon_reviews;
+----------------+----------------+-----------+--------------+-----------+------------+
| TableName      | IndexName      | Size      | ReplicaCount | RowCount  | RemoteSize |
+----------------+----------------+-----------+--------------+-----------+------------+
| amazon_reviews | amazon_reviews | 25.873 GB | 16           | 135589433 | 0.000      |
|                | Total          | 25.873 GB | 16           |           | 0.000      |
+----------------+----------------+-----------+--------------+-----------+------------+
2 rows in set (0.00 sec)
           

Text search query acceleration

No index hard match

Once the environment and data are ready, we try to do a text search query on the review_body column. The specific requirement is to find the top 5 products in the dataset that contain the keyword "is super awesome" in reviews, and sort them in descending order by number of reviews, showing the ID of each product, a random product title, an average star rating, and the total number of reviews. Columns review_body are characterized by long comments content, so there is a performance pressure on text search.

First, let's go straight to the query, and here's an example statement for the query:

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews
WHERE
    review_body LIKE '%is super awesome%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
LIMIT 5;
           

The query took 7.6 seconds

+------------+------------------------------------------+--------------------+-------+
| product_id | any_value(product_title)                 | rating             | count |
+------------+------------------------------------------+--------------------+-------+
| B00992CF6W | Minecraft                                | 4.8235294117647056 |    17 |
| B009UX2YAC | Subway Surfers                           | 4.7777777777777777 |     9 |
| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |              4.875 |     8 |
| B0086700CM | Temple Run                               |                  5 |     6 |
| B00KWVZ750 | Angry Birds Epic RPG                     |                  5 |     6 |
+------------+------------------------------------------+--------------------+-------+
5 rows in set (7.60 sec)
           

利用 Ngram BloomFilter 索引加速查询

Next, let's try using the Ngram BloomFilter index for query acceleration

ALTER TABLE amazon_reviews ADD INDEX review_body_ngram_idx(review_body) USING NGRAM_BF PROPERTIES("gram_size"="10", "bf_size"="10240");
           

After adding the Ngram BloomFilter index, execute the same query again. The query time is shortened to 0.93 seconds, which is 8 times more efficient than that without indexes.

+------------+------------------------------------------+--------------------+-------+
| product_id | any_value(product_title)                 | rating             | count |
+------------+------------------------------------------+--------------------+-------+
| B00992CF6W | Minecraft                                | 4.8235294117647056 |    17 |
| B009UX2YAC | Subway Surfers                           | 4.7777777777777777 |     9 |
| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |              4.875 |     8 |
| B0086700CM | Temple Run                               |                  5 |     6 |
| B00KWVZ750 | Angry Birds Epic RPG                     |                  5 |     6 |
+------------+------------------------------------------+--------------------+-------+
5 rows in set (0.93 sec)
           

Next, let's expand on the code sample. When you use the ALTER TABLE statement to add the Ngram BloomFilter index to a table, the gram_size and bf_size parameters have specific meanings:

  • gram_size: Represents the n-value in the n-gram, which is the length of consecutive characters. In the code example above, "gram_size" = "10" means that each n-gram contains 10 characters. This means that the text will be cut into several strings of characters in length of 10 characters, which will be used to build the index.
  • bf_size: indicates the size of the Bloom Filter, in bytes. For example, "bf_size" = "10240" means that the Bloom Filter used has a data size footprint of 10240 bytes.

Now that we've learned the basic parameter definitions, let's explore how Ngram BloomFilter accelerates queries:

  • Ngram tokenization: Use gram_size to tokenize each row of data, and when gram_size=5, "hello world" is split into ["hello", "ello", "llo w", "lo wo", "o wor", "worl", "world"]. These substrings are hashed and then added to the Bloom Filter of the appropriate size (bf_size). Since Doris data is stored on a page-by-page basis, the corresponding Bloom Filter is also generated on a per-page basis.
  • Query acceleration: Taking hello as an example, it will also be split and generate a corresponding Bloom Filter during the matching process, which can be used to compare with the Bloom Filter of each page. If the Bloom Filter judges to contain a matching string (which may be a false positive), the page is loaded for further matching, otherwise, the page is skipped. The principle is to reduce the amount of data that needs to be scanned by skipping pages that do not need to be loaded, thereby significantly reducing query latency.
40x faster text retrieval performance and in-depth interpretation of inverted indexes in Apache Doris
40x faster text retrieval performance and in-depth interpretation of inverted indexes in Apache Doris

The size of gram_size directly affects the efficiency of matching, while the size of bf_size affects the storage capacity and false positive rate. In general, a larger bf_size can reduce the false positive rate, but it also takes up more storage space. Therefore, we recommend that you consider the configuration parameters from the following two aspects:

Data attributes: Consider the type of data you want to index. For text data, it needs to be determined based on the average length of the text and the distribution of characters.

  • For shorter text, such as words or phrases: A smaller gram_size (e.g., 2-4) and a smaller bf_size may be more appropriate.
  • For longer texts such as sentences or large paragraph descriptions: larger gram_size (e.g., 5-10) and larger bf_size may be more effective.

Query pattern: Consider the typical pattern of the query.

  • If the query typically contains phrases or near-complete words, a larger gram_size may be better.
  • For fuzzy matches or queries with multiple variations, a smaller gram_size can provide a more flexible match.

Accelerate queries with inverted indexes

In addition to using the Ngram BloomFilter index for query acceleration, you can also choose to use the inverted index to further accelerate the efficiency of text search. You can use the following steps to build an inverted index:

1. Add an inverted index: Add an inverted index to the review_body column of the amazon_reviews table, which uses English word segmentation and supports Phrase query, which means that when the phrase query is a text search, the order of the words after word segmentation will affect the search results. 2. Create an index for historical data: Construct an index of historical data based on the new index information, so that historical data can also be queried using inverted indexes.

ALTER TABLE amazon_reviews ADD INDEX review_body_inverted_idx(`review_body`) 
    USING INVERTED PROPERTIES("parser" = "english","support_phrase" = "true"); 
BUILD INDEX review_body_inverted_idx ON amazon_reviews;           

3. View and verify: After the index is built, you can view the index construction status in the following ways:

mysql> show BUILD INDEX WHERE TableName="amazon_reviews";
+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
| JobId | TableName      | PartitionName  | AlterInvertedIndexes                                                                                                              | CreateTime              | FinishTime              | TransactionId | State    | Msg  | Progress |
+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
| 10152 | amazon_reviews | amazon_reviews | [ADD INDEX review_body_inverted_idx (
review_body
) USING INVERTED PROPERTIES("parser" = "english", "support_phrase" = "true")],  | 2024-01-23 15:42:28.658 | 2024-01-23 15:48:42.990 | 11            | FINISHED |      | NULL     |
+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
1 row in set (0.00 sec)
           

IF YOU ARE UNSURE ABOUT THE EFFECT OF WORD SEGMENTATION, YOU CAN USE THE TOKENIZE FUNCTION TO PERFORM A WORD SEGMENTATION TEST. THE TOKENIZE FUNCTION RECEIVES TWO INPUTS: A TEXT THAT NEEDS TO BE TOKENIZED, AND A PROPERTY FIELD THAT IS TOKENIZED.

mysql> SELECT TOKENIZE('I can honestly give the shipment and package 100%, it came in time that it was supposed to with no hasels, and the book was in PERFECT condition.
super awesome buy, and excellent for my college classs', '"parser" = "english","support_phrase" = "true"');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize('I can honestly give the shipment and package 100%, it came in time that it was supposed to with no hasels, and the book was in PERFECT condition. super awesome buy, and excellent for my college classs', '"parser" = "english","support_phrase" = "true"')                                              |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["i", "can", "honestly", "give", "the", "shipment", "and", "package", "100", "it", "came", "in", "time", "that", "it", "was", "supposed", "to", "with", "no", "hasels", "and", "the", "book", "was", "in", "perfect", "condition", "super", "awesome", "buy", "and", "excellent", "for", "my", "college", "classs"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.05 sec)
           

After the inverted index is created, we use MATCH_PHRASE to query product reviews that contain the keyword "is super awesome" (see earlier for specific needs).

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews
WHERE
    review_body MATCH_PHRASE 'is super awesome'
GROUP BY
    product_id

ORDER BY
    count DESC,
    rating DESC,
    product_id
LIMIT 5;
           

To illustrate with the code example above, review_body MATCH_PHRASE 'is super awesome' means a phrase match query on column review_body. Specifically, the query looks for snippets of text that contain both the words "is", "super", and "awesome" after English word segmentation in review_body, with the order of "is" first, "super" in the middle, and "awesome" last, with no spacing between the words (case-insensitive).

It should be noted here that the difference between MATCH and LIKE queries is that MATCH queries ignore case and divide sentences into words for matching, which can more quickly locate the results that meet the conditions, especially in the case of large-scale datasets, the efficiency of MATCH is more obvious.

As shown in the following results, the query time after the inverted index is enabled is only 0.19 seconds, which is 4 times higher than that when the Ngram BloomFilter index is only enabled, and nearly 40 times higher than when the index is not enabled, greatly improving the efficiency of text retrieval.

+------------+------------------------------------------+-------------------+-------+
| product_id | any_value(product_title)                 | rating            | count |
+------------+------------------------------------------+-------------------+-------+
| B00992CF6W | Minecraft                                | 4.833333333333333 |    18 |
| B009UX2YAC | Subway Surfers                           |               4.7 |    10 |
| B00DJFIMW6 | Minion Rush: Despicable Me Official Game |                 5 |     7 |
| B0086700CM | Temple Run                               |                 5 |     6 |
| B00KWVZ750 | Angry Birds Epic RPG                     |                 5 |     6 |
+------------+------------------------------------------+-------------------+-------+
5 rows in set (0.19 sec)
           

The reason for this acceleration is that inverted indexing is a mapping from words to a list of line numbers by breaking down text into words. These mappings are sorted by word and a skip table index is constructed. When querying a specific word, you can quickly locate the corresponding row number list in the ordered mapping by using methods such as skip table index and binary search, and then obtain the content of the row. This query method avoids row-by-row matching, reduces the complexity of the algorithm from O(n) to O(logn), and significantly improves query performance when processing large-scale data.

40x faster text retrieval performance and in-depth interpretation of inverted indexes in Apache Doris

In order to understand the acceleration principle of inverted indexes, it is necessary to start with the internal referral, read-write logic of inverted indexes. In Doris, inverted indexes are logically applied at the column level of a table, whereas from a physical storage and implementation perspective, inverted indexes are actually built at the data file level. The details are as follows:

  • Write phase: When data is written to the data file, it is also synchronously written to the indexing file, and the row number of each written data is corresponding to the row number in the inverted index.
  • Query stage: If the query WHERE condition contains columns that have been indexed inverted, Doris will automatically query the index file to return a list of row numbers that meet the conditions, and then use Doris's common row number filtering mechanism to skip unnecessary rows and pages and read only the rows that meet the conditions to achieve the query acceleration effect.

In general, Doris's inverted indexing mechanism works together with data files and index files at the physical level, and at the logical level, it achieves efficient data retrieval and query acceleration through the mapping of columns and rows.

Non-primary key column queries are accelerated

To further validate the impact of inverted indexes on query acceleration for non-primary key columns, we chose to query the dimension information for product IDs and user IDs.

Inverted indexes are not enabled

When querying the comments of user 13916588 on product B002DMK1R0, you need to run the following SQL statement to query the entire table data, which takes 1.81 seconds.

mysql> SELECT product_title,review_headline,review_body,star_rating 
FROM amazon_reviews 
WHERE product_id='B002DMK1R0' AND customer_id=13916588;
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
| product_title                                                   | review_headline      | review_body                                                                                                                 | star_rating |
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice Features But... | This is a great GPS. Gets you where you are going. Don't forget to buy the seperate (grr!) cord for the traffic kit though! |           4 |
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
1 row in set (1.81 sec)
           

Inverted index queries are accelerated

Next, let's add inverted indexes for product_id and customer_id. In this scenario, the use of inverted indexes is different from that of text search, and there is no need to tokenize product_id and customer_id, but only to create an inverted mapping table for the Values → RowIDs of these two columns.

First, create an inverted index by executing the following SQL statement:

ALTER TABLE amazon_reviews ADD INDEX product_id_inverted_idx(product_id) USING INVERTED ;
ALTER TABLE amazon_reviews ADD INDEX customer_id_inverted_idx(customer_id) USING INVERTED ;
BUILD INDEX product_id_inverted_idx ON amazon_reviews;
BUILD INDEX customer_id_inverted_idx ON amazon_reviews;
           

Second, when the index is built and the same query statement is executed, the query time is reduced from 1.81 seconds to 0.06 seconds, which is significantly reduced, and the query efficiency is improved by about 30 times compared with the situation without indexing.

mysql> SELECT product_title,review_headline,review_body,star_rating FROM amazon_reviews WHERE product_id='B002DMK1R0' AND customer_id='13916588';
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
| product_title                                                   | review_headline      | review_body                                                                                                                 | star_rating |
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice Features But... | This is a great GPS. Gets you where you are going. Don't forget to buy the seperate (grr!) cord for the traffic kit though! |           4 |
+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+
1 row in set (0.06 sec)           

Observations show that inverted indexes have a very good speedup effect in dimension queries that resemble non-primary key columns. For a more in-depth and intuitive view of the acceleration effect, you can explore it further with Doris Profile information.

Profile 分析

It should be noted that before opening the query profile, you need to run SET enable_profile=true on the MySQL client. Command. After completing the query, execute the query statement and visit http://FE_IP:FE_HTTP_PORT/QueryProfile to view the profile ID and detailed profile information associated with the query.

In this article, only the SegmentIterator Profile information for a specific fragment is taken to illustrate the reason for inverted index query acceleration.

SegmentIterator:
  - FirstReadSeekCount: 0
  - FirstReadSeekTime: 0ns
  - FirstReadTime: 13.119ms
  - IOTimer: 19.537ms
  - InvertedIndexQueryTime: 11.583ms
  - RawRowsRead: 1
  - RowsConditionsFiltered: 0
  - RowsInvertedIndexFiltered: 16.907403M (16907403)
  - RowsShortCircuitPredInput: 0
  - RowsVectorPredFiltered: 0
  - RowsVectorPredInput: 0
  - ShortPredEvalTime: 0ns
  - TotalPagesNum: 27
  - UncompressedBytesRead: 3.71 MB
  - VectorPredEvalTime: 0ns           

From RowsInvertedIndexFiltered: 16.907403M (16907403) and RawRowsRead: 1 in the above profile, we can observe that the inverted index filters 16907403 rows of data, and finally only retains 1 row of data (i.e., the one that hit). According to FirstReadTime: 13.119ms, it takes 13.119 ms to read the page where this row of data is located, and according to InvertedIndexQueryTime: 11.583ms, it takes only 11.58 ms to execute the inverted index. This means that the inverted index filters only 16907403 rows of data in 11.58 ms, which is very efficient.

For a more direct comparison, let's show how the SegmentIterator performs without adding an inverted index:

SegmentIterator:
  - FirstReadSeekCount: 9.374K (9374)
  - FirstReadSeekTime: 400.522ms
  - FirstReadTime: 3s144ms
  - IOTimer: 2s564ms
  - InvertedIndexQueryTime: 0ns
  - RawRowsRead: 16.680706M (16680706)
  - RowsConditionsFiltered: 226.698K (226698)
  - RowsInvertedIndexFiltered: 0
  - RowsShortCircuitPredInput: 1
  - RowsVectorPredFiltered: 16.680705M (16680705)
  - RowsVectorPredInput: 16.680706M (16680706)
  - RowsZonemapFiltered: 226.698K (226698)
  - ShortPredEvalTime: 2.723ms
  - TotalPagesNum: 5.421K (5421)
  - UncompressedBytesRead: 277.05 MB
  - VectorPredEvalTime: 8.114ms
           

Based on the above profile observations, it takes FirstRead 3.14s to load 16680706 rows of data without indexes for filtering, and then uses Predicate Evaluate for conditional filtering to filter out 16680705 rows, while the conditional filtering itself consumes less than 10ms, so it can be seen that most of the time is spent loading raw data.

Through comparison, it can be seen that establishing an inverted index can greatly reduce the time of loading raw data and improve the execution efficiency of queries. Indexes can quickly locate rows that meet the criteria, reducing unnecessary data loading and processing, saving time and resources.

Low-cardinality text column indexing acceleration

It's no secret that inverted indexes are very effective for querying high-cardinality text columns. However, in the case of low-cardinality columns, it can be a negative impact on query performance due to the need to create too many index items, resulting in greater overhead. Next, we'll filter by using product_category as the predicate column to test how well Apache Doris inverted indexes speed up low-cardinality text columns.

mysql> SELECT COUNT(DISTINCT product_category) FROM amazon_reviews ;
+----------------------------------+
| count(DISTINCT product_category) |
+----------------------------------+
|                               43 |
+----------------------------------+
1 row in set (0.57 sec)
           

As you can see from the above operations, there are only 43 categories by product_category, which is a typical low-cardinality text column. Next, we add an inverted index to it

ALTER TABLE amazon_reviews ADD INDEX product_category_inverted_idx(`product_category`) USING INVERTED;
BUILD INDEX product_category_inverted_idx ON amazon_reviews;           

After adding an inverted index, run the following SQL query to query the top three products with the highest number of reviews among the products classified as Mobile_Electronics

SELECT 
    product_id,
    product_title,
    AVG(star_rating) AS rating,
    any(review_body),
    any(review_headline),
    COUNT(*) AS count 
FROM 
    amazon_reviews 
WHERE 
    product_category = 'Mobile_Electronics' 
GROUP BY 
    product_title, product_id 
ORDER BY 
    count DESC 
LIMIT 10;
           

As can be seen from the results below, the query time is 1.54s after adding the inverted index.

+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+
| product_id | product_title                                                                                                                                                                                          | rating             | any_value(review_body)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | any_value(review_headline)      | count |
+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+
| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 | Great cable and works well. Exact fit as Apple cable. I would recommend this to anyone who is looking to save money and for a quality cable.                                                                                                                                                                                                                                                                                                                                                             | Apple certified lightning cable |  1078 |
| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod                                                                                                                                        | 2.4281805745554035 | A total waste of money for me because I needed it for a iPhone 4.  The plug will only go in upside down and thus won't work at all.                                                                                                                                                                                                                                                                                                                                                                      | Won't work with a iPhone 4!     |   731 |
| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB cable)                                                        | 4.5216095380029806 | I bought this product based on the reviews that i read and i am very glad that i did. I did have a problem with the product charging my itouch after i received it but i emailed the company and they corrected the problem immediately. VERY GOOD customer service, very prompt. The product itself is very good. It charges my power hungry itouch very quickly and the imax battery power lasts for a long time. All in all a very good purchase that i would recommend to anyone who owns an itouch. | Great product & company         |   671 |
+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+
3 rows in set (1.54 sec)
           

Next, we turn off the inverted index to observe the query time when the inverted index is not added. It should be noted here that when you need to close the index or find that the effect is not satisfactory after adding the index, you can run set enable_inverted_index_query=false in the MySQL client; to quickly and easily turn off inverted indexes temporarily. We run the query SQL again, as shown below, and the query takes 1.8s.

+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+
| product_id | product_title                                                                                                                                                                                          | rating             | any_value(review_body)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | any_value(review_headline)            | count |
+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+
| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 | These cables are great. They feel quality, and best of all, they work as they should. I have no issues with them whatsoever and will be buying more when needed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Just like the original from Apple     |  1078 |
| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod                                                                                                                                        | 2.4281805745554035 | I ordered two of these chargers for an Iphone 4. Then I started experiencing weird behavior from the touch screen. It would select the wrong area of the screen, or it would refuse to scroll beyond a certain point and jump back up to the top of the page. This behavior occurs whenever either of the two that I bought are attached and charging. When I remove them, it works fine once again. Needless to say, these items are being returned.                                                                                                                                                                                                                                                                                                                                                                              | Beware - these chargers are defective |   731 |
| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB cable)                                                        | 4.5216095380029806 | I received this in the mail 4 days ago, and after charging it for 6 hours, I've been using it as the sole source for recharging my 3Gs to see how long it would work.  I use my Iphone A LOT every day and usually by the time I get home it's down to 50% or less.  After 4 days of using the IMAX to recharge my Iphone, it finally went from 3 bars to 4 this afternoon when I plugged my iphone in.  It charges the iphone very quickly, and I've been topping my phone off (stopping around 95% or so) twice a day.  This is a great product and the size is very similar to a deck of cards (not like an iphone that someone else posted) and is very easy to carry in a jacket pocket or back pack.  I bought this for a 4 day music festival I'm going to, and I have no worries at all of my iphone running out of juice! | FANTASTIC product!                    |   671 |
+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+
3 rows in set (1.80 sec)
           

In summary, inverted indexes also have a 15% improvement in query performance for low-cardinality column scenarios, which is not as good as that of high-cardinality column scenarios, but does not have a degrading effect or negative impact. In addition, Apache Doris uses better encoding (e.g., dictionary encoding) and compression techniques for low-cardinality columns, and can be effectively filtered by built-in indexes (e.g., zonemap). Therefore, even if you don't add an inverted index, you can still show good query results.

Concluding remarks

All in all, inverted indexes in Apache Doris significantly optimize filtering operations on predicate columns, i.e., where clauses in SQL queries. By matching row numbers exactly, the amount of data that needs to be scanned at the storage layer is reduced, which improves query performance. Even with limited performance gains, inverted indexes don't negatively impact query efficiency. IN ADDITION, INVERTED INDEXES SUPPORT LIGHTWEIGHT INDEX MANAGEMENT OPERATIONS, SUCH AS ADD/DROP INDEXES AND BUILD INDEX OPERATIONS. At the same time, it also provides the ability to easily enable or disable indexes (enable_inverted_index_query=true/false) in the MySQL client, so that users can easily use inverted indexes to verify query acceleration effect.

Inverted indexes and NGram Bloom Filter indexes provide query acceleration scenarios for different scenarios, and the specific characteristics of the dataset and query patterns are key considerations when choosing an index type. Here are some common adaptation scenarios:

  • Large-scale non-primary key column query scenario: In this scenario, there are often a large number of scattered values listed in the value, and the query value hit is very low. To speed up queries, in addition to using Doris's built-in intelligent indexing capabilities when creating tables, you can also add inverted indexes to the corresponding columns to speed up queries. Inverted indexes have complete support for scalar types such as character types, numeric types, and dates.
  • Text retrieval scenario of short text columns: If the distribution of short text is relatively discrete (i.e., the similarity between texts is low), the Ngram Bloom Filter index is suitable for processing, which can effectively handle the fuzzy matching query (LIKE) of short text. At the same time, the vectorization processing capabilities of Apache Doris can be applied and exerted more fully and efficiently in short text scenarios. If the distribution of short texts is relatively concentrated (for example, a large number of texts are similar and a small number of texts are different), it is suitable to use an inverted word segmentation index, which can ensure that the dictionary is relatively small and suitable for quick search to obtain a list of line numbers.
  • Text search scenario for long text columns: For long text columns, inverted word segmentation is a better solution. Compared with brute-force string matching, inverted indexes provide more efficient query performance and avoid a lot of CPU resource consumption.

It has been nearly a year since Apache Doris first introduced inverted indexes, from the early 2.0 Preview version to the recently released 2.0.4, this year has been polished and verified by a large number of open source users under the massive data of real business environments, and the performance and stability have been fully verified. In the follow-up planning, we will continue to iterate and optimize on the existing basis, including:

  • The custom inverted index word segmentation capability provides users with custom word tokenizers according to their needs for word segmentation effects in different scenarios.
  • More types of inverted indexes are supported, and complex data types such as Array and Map will be added in the future to more comprehensively meet the needs of various queries.