天天看點

Principles and Applications of the Index Types Supported by PostgreSQL

Principles and Applications of the Index Types Supported by PostgreSQL

PostgreSQL supports a wide range of features:

Open data interfaces that allow PostgreSQL to support a wide range of different data types. Apart from those supported by traditional databases, it also supports GIS, JSON, RANGE, IP, ISBN, image characteristics, chemistry, DNA and a host of others. Users can also expand support for more data types based on their actual business needs.

Open operator interfaces that allow PostgreSQL to support not only common operators but also extensional operators, including distance, logical operators (And, Or, Not), image similarity, and geometric , etc. Users can also expand support for more operators based on their actual business needs.

An open external data source interface allows PostgreSQL to support a wide range of different external data sources. You can read and write MySQL, redis, mongo, oracle, sqlserver, hive, www, hbase, ldap, and many other data sources with the FDW interface.

An open language interface allows PostgreSQL to support the use of almost any programming language in the world, including plpython, plperl, pljava, plR, plCUDA, and plshell, as database functions and scripts. Users can expand the languages to be supported by PostgreSQL through the language handler.

Open index interfaces allow PostgreSQL to support extensive indexes, such as btree, hash, gin, gist, sp-gist, brin, bloom, rum, zombodb, and bitmap (greenplum extend). Users can choose different indexes based on different data and query types.

PostgreSQL also supports BitmapAnd and BitmapOr optimization methods internally, and supports the merger of scanning operations on multiple indexes, so as to more efficiently access data on multiple indexes.

PostgreSQL has different index interfaces for different data types and business scenarios. Let's introduce the principles and application scenarios for each index.

<a href="https://github.com/digoal/blog/blob/master/201605/20160528_01.md">In-depth Introduction to PostgreSQL B-Tree Index Structure</a>

B-tree indexes are applicable to all data types. They can be used in sorting as well as greater than, less than, equals, greater than or equal to, less than or equal to searches.

If we use indexes in combination with Recursive Query, we can also achieve rapid searches for sparse codes.

<a href="https://github.com/digoal/blog/blob/master/201705/20170519_01.md">PostgrSQL - a few applications of recursive SQL - thoughts of geeks and normal persons</a>

src/backend/access/hash/README

(hash index entries store only the hash code, not the actual data value, for each indexed item. )

A hash index stores the hash of the value of the indexed field, and only supports equality queries.

Hash indexes are particularly useful in scenarios where you have very long VALUE fields (an index page of the B-tree indexes needs to store at least 3 Entries, so B-tree indexes do not support very long VALUE fields). For example, when users need to run “Equal To” searches on extremely long strings, it is recommended to use the Hash indexes.

A gin index is an inverted index that stores the value or value element of the indexed field, as well as the list or tree of the row.

(col_val:(tid_list or tid_tree), col_val_elements:(tid_list or tid_tree))

<a href="https://github.com/digoal/blog/blob/master/201702/20170204_01.md">PostgreSQL - GIN index principles</a>

<a href="https://github.com/digoal/blog/blob/master/201702/20170205_01.md">The best sword for the best man - equivalent searches on arbitrary combinations of fields - exploring PostgreSQL multi-array expansion of B-tree indexes (GIN)</a>

Gin indexes are applicable to searches on values with multi-value data types, such as array, full-text search, and token. Gin indexes support multiple searches for different types of data, including intersection, contains, greater than, left, and right.)

When user data is relatively sparse and the user wishes to search for a value, B-tree_gin may be adapted to support any data type supported by B-tree. (Supports B-tree operators)

When a user needs to search by random columns, gin allows the creation of independent index fields in multiple columns, and supports the merger of bitmapAnd and bitmapOr to quickly return data for random column search requests.

1.Search over multi-value data

2.Search over single-value sparse data

3.Search over multiple random columns

GiST stands for Generalized Search Tree.

It is a balanced, tree-structured access method that acts as a base template in which to implement arbitrary indexing schemes.

B-trees, R-trees and many other indexing schemes can be implemented in GiST.

<a href="https://github.com/digoal/blog/blob/master/201612/20161231_01.md">Getting started with complicated fuzzy search - PostgreSQL unique skills - I. Principles and technical backgrounds of GIN, Gist, SP-GiST, and RUM indexes</a>

GiST is a generalized index interface. You can use GiST to create B-tree, R-tree and other index structures.

Different types of indexes support different indexed searches. For example:

The geometry index type supports location searches (Contain, Intersection, Top, Bottom, Left, Right, etc.), and sorting by distance.

The range type supports location searches (Contain, Intersection, Left, Right, etc.).

The IP type supports location searches (Contain, Intersection, Left, Right, etc.).

The spatial type (PostGIS) supports location searches (Contain, Intersection, Top, Bottom, Left, Right, etc.), and sorting by distance.

The scale type supports sorting by distance.

<a href="https://github.com/digoal/blog/blob/master/201601/20160119_01.md">Performance of PostgreSQL for querying neighboring geographical locations among tens of billions of data entries</a>

1.Searching with a geometry index

2.Sorting with a scale index

SP-GiST is an abbreviation for space-partitioned GiST.

SP-GiST supports partitioned search trees, which facilitate development of a wide range of different non-balanced data structures, such as quad-trees, k-d trees, and radix trees (tries).

The common feature of these structures is that they repeatedly divide the search space into partitions that need not be of equal size.

Searches that are well matched to the partitioning rule can be very fast.

Similar to GiST, SP-GiST is a generic index interface. The difference is that the SP-GiST uses spatial partitioning, which allows SP-GiST to better support unevenly distributed data structures, such as quad-trees, k-d trees, and radis trees.

<a href="https://github.com/digoal/blog/blob/master/201706/20170627_01_pdf_001.pdf">《Space-partitioning trees in PostgreSQL》</a>

<a href="https://github.com/digoal/blog/blob/master/201706/20170627_01_pdf_002.pdf">《SP-GiST for PostgreSQL User Manual》</a>

Searching with a range index

A BRIN index is a block range index. It is different from other indexes, such as B-tree indexes, in that a BRIN index does not store information by row number. Rather, it stores the statistical information of each data block or each consecutive data block. The result is that a BRIN index requires much less space than other indexes, and has very low impact on the process of writing, updating, and deleting data.

BRIN indexes are lossy, and perform particularly well when the values of the indexed columns are strongly correlated with the physical storage.

Taking chronological data for example, BRIN indexes would perform very well on Equals and Range searches if created based on the time or serial fields.

<a href="https://github.com/digoal/blog/blob/master/201504/20150419_01.md">BRIN (block range index) index</a>

<a href="https://github.com/digoal/blog/blob/master/201604/20160414_01.md">The cutting-edge technology of PostgreSQL for IoT - creating an index (BRIN Index) a fraction of the normal size</a>

<a href="https://github.com/digoal/blog/blob/master/201702/20170219_01.md">PostgreSQL aggregates storage and BRIN indexes - searching high throughput data on highly concurrent behavior and tracks</a>

<a href="https://github.com/digoal/blog/blob/master/201706/20170611_02.md">How to ensure chronologically linear storage when writing in parallel to heap tables with PostgreSQL - BRIN index optimization</a>

<a href="https://github.com/postgrespro/rum">https://github.com/postgrespro/rum</a>

RUM is a Postgres Pro Enterprise extension that serves as an enhanced version GIN index and is perfect for full-text search.

Enhancements include:

A RUM index stores the positional information of lexems. So, when calculating the ranking, RUM does not need to query back in the table (required for GIN indexes).

RUM supports phrase searches, which are not supported by GIN.

A RUM index allows the user to store field values apart from the ctid (line number) in the posting tree, e.g. the timestamp.

As a result, a RUM index supports not only the full-text searches supported by GIN indexes, but also calculating and sorting by the similarity of texts. It also supports positional matching, e.g. (for Fast and Furious, we can do the matching with “Fast” &lt;2&gt; “Furious”, which can not be achieved with GIN indexes).

Positional information is as follows:

<a href="https://github.com/digoal/blog/blob/master/201610/20161019_01.md">Full-text search with PostgreSQL is way too fast - RUM index interface (Pandora box)</a>

<a href="https://github.com/digoal/blog/blob/master/201701/20170116_04.md">Application of PostgreSQL in combination with cosine and linear relational algorithms in fields of text, images, arrays, and similarities - Analysis of RUM and Smlar in 3 application scenarios</a>

GIN VS RUM

GIN

RUM

Bloom provides a PostgreSQL index access method based on the Bloom filter structure. A bloom index is lossy, and it can collect the results set (excluding results that do not meet the conditions, and select those that meet the conditions from the rest of the results). Therefore, a secondary check is required. Bloom indexes support equality queries on arbitrary column combinations.

A Bloom index stores signatures. A larger signature requires larger space, but it also means more accurate exclusion. Of course, there are pros and cons to this method.

Bloom provides an index access method based on Bloom filters.

A Bloom filter is a space-efficient data structure that is used to test whether an element is a member of a set. In the case of an index access method, it allows fast exclusion of non-matching tuples via signatures whose size is determined at index creation.

This type of index is most useful when a table has many attributes and queries test arbitrary combinations of them.

Bloom indexes are useful for queries on arbitrary combinations of multiple columns

<a href="https://github.com/digoal/blog/blob/master/201605/20160523_01.md">Cutting-edge technology of PostgreSQL 9.6 – a single Bloom algorithm index to support queries on arbitrary combinations of columns</a>

ZomboDB is a PostgreSQL extension that allows you to create indexes that use Elasticsearch (“ES”).

<a href="https://github.com/zombodb/zombodb">https://github.com/zombodb/zombodb</a>

Combining with ES allows the creation of a search engine that uses the SQL interface, as well as transparent data searches.

A Bitmap index is the index interface of Greenplum, similar to GIN. The difference is that the KEY of a bitmap index is the value of the index column, while the VALUE of a bitmap is BIT (each BIT matches a line) rather than list of row numbers or a tree.

<a href="https://github.com/digoal/blog/blob/master/201705/20170512_01.md">Greenplum best practices - when to use Bitmap indexes</a>

When the number of unique values of a field is between 100 and 100,000 (bitmap is not recommended if the number is beyond this range), if the number of records in the table is very big and it does not change frequently (or AO table), then using Bitmap indexes is probably your best option. Bitmap indexes can allow you to quickly search for multiple values or a single value. We only need to perform bit operations on or compute the Bitmap of the row numbers to obtain the final Bitmap, and then retrieve the results from lines based on that.

Similar to B-tree indexes, Bitmap indexes support equals, greater than, smaller than, greater than or equal to, and smaller than or equal to searches.

Varbitx is an extension pack for Alibaba Cloud RDS that has function interfaces of extensive bit types. Varbitx is actually not an index interface, but it can replace and produce the same effect as Bitmap indexes when used in conjunction with PostgreSQL.

Application scenarios

<a href="https://github.com/digoal/blog/blob/master/201705/20170502_01.md">Introduction to the Alibaba Cloud RDS for PostgreSQL varbitx extension and real-time image application scenarios</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161021_01.md">Creation of a real-time image recommendation system based on Alibaba Cloud RDS PostgreSQL</a>

<a href="https://github.com/digoal/blog/blob/master/201706/20170612_01.md">PostgreSQL (varbit, roaring bitmap) VS pilosa (bitmap library)</a>

Omitted. Please refer to contents above.

PostgreSQL allows users to create partial indexes. For example, if your business only cares about active users, you can create indexes for only active users.

Another unique feature in PostgreSQL is expression indexes. Expression indexes are great for scenarios where the user data needs to be converted before the search can be run. For example, when geographic coordinates submitted by certain devices do not conform to the Chinese national standard, we need to convert them into Chinese coordinates to run the search.

We can create expression indexes specifically for these fields, and build the conversion process into the expression, which can also be used in searches.

<a href="https://www.postgresql.org/docs/9.6/static/pageinspect.html">https://www.postgresql.org/docs/9.6/static/pageinspect.html</a>

You can read the contents of the index pages in the pageinspect module.

繼續閱讀