天天看點

ClickHouse 概述What Is ClickHouse? 烏拉

What Is ClickHouse? 烏拉

ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). https://clickhouse.tech/docs/en/

ClickHouse 是戰鬥民族的 Yandex号稱小谷歌 于 2016 年開源的列式存儲資料庫(DBMS),使用 C++ 語言編寫,主要用于線上分析處理查詢(OLAP),能夠使用 SQL 查詢實時生成分析資料報告,2020年底在國内遍地開花,很多大廠都在部署大量的節點,從戰略層面直接撇開Hadoop生态(戰略核潛艇級别),烏拉。   ClickHouse 采用類 LSM Tree 的結構(同學們是不是很驚喜,有沒有想起來HBASE),資料寫後定期在背景 Compaction 。通過類 LSM tree 的結構,ClickHouse 在資料導入時全部是順A ppend 寫,寫入後資料段不可更改,在背景Compaction 時也是多個段M erge sort 後順序寫回磁盤。順序寫的特性,充分利用了磁盤的吞吐能力,即便在 HDD 上也有着優異的寫入性能。

Key Properties of OLAP Scenario 

  • The vast majority of requests are for read access.
  • Data is updated in fairly large batches (> 1000 rows), not by single rows; or it is not updated at all.
  • Data is added to the DB but is not modified.
  • For reads, quite a large number of rows are extracted from the DB, but only a small subset of columns.
  • Tables are “wide,” meaning they contain a large number of columns.
  • Queries are relatively rare (usually hundreds of queries per server or less per second).
  • For simple queries, latencies around 50 ms are allowed.
  • Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).
  • Requires high throughput when processing a single query (up to billions of rows per second per server).
  • Transactions are not necessary.
  • Low requirements for data consistency.
  • There is one large table per query. All tables are small, except for one.
  • A query result is significantly smaller than the source data. In other words, data is filtered or aggregated, so the result fits in a single server’s RAM.

It is easy to see that the OLAP scenario is very different from other popular scenarios (such as OLTP or Key-Value access). So it does not make sense to try to use OLTP or a Key-Value DB for processing analytical queries if you want to get decent performance. For example, if you try to use MongoDB or Redis for analytics, you will get very poor performance compared to OLAP databases.

OLAP 場景的關鍵屬性 

  • 絕大多數請求是讀取通路。
  • 資料以相當大的批次(> 1000 行)更新,而不是按單行更新;或者它根本沒有更新。
  • 資料被添加到資料庫中,但不會被修改。
  • 對于讀取,從資料庫中提取了相當多的行,但隻提取了一小部分列。
  • 表是“寬的”,這意味着它們包含大量列。
  • 查詢相對較少(通常每台伺服器有數百個查詢或每秒更少)。
  • 對于簡單查詢,允許大約 50 毫秒的延遲。
  • 列值相當小:數字和短字元串(例如,每個 URL 60 個位元組)。
  • 處理單個查詢時需要高吞吐量(每台伺服器每秒高達數十億行)。
  • 交易不是必需的。
  • 對資料一緻性要求低。
  • 每個查詢有一個大表。除了一張桌子,所有桌子都很小。
  • 查詢結果明顯小于源資料。換句話說,資料被過濾或聚合,是以結果适合單個伺服器的 RAM。

不難看出,OLAP 場景與其他流行的場景(如 OLTP 或 Key-Value 通路)有很大的不同。是以,如果您想獲得不錯的性能,嘗試使用 OLTP 或鍵值資料庫來處理分析查詢是沒有意義的。例如,如果您嘗試使用 MongoDB 或 Redis 進行分析,與 OLAP 資料庫相比,您将獲得非常差的性能。

Why Column-Oriented Databases Work Better in the OLAP Scenario 

Column-oriented databases are better suited to OLAP scenarios: they are at least 100 times faster in processing most queries. The reasons are explained in detail below, but the fact is easier to demonstrate visually:

面向列的資料庫更适合 OLAP 場景:它們處理大多數查詢的速度至少快 100 倍。下面詳細解釋原因,但事實更容易直覺地展示:

Row-oriented DBMS

ClickHouse 概述What Is ClickHouse? 烏拉

 Column-oriented DBMS

ClickHouse 概述What Is ClickHouse? 烏拉

See the difference?

Input/output 

  1. For an analytical query, only a small number of table columns need to be read. In a column-oriented database, you can read just the data you need. For example, if you need 5 columns out of 100, you can expect a 20-fold reduction in I/O.
  2. Since data is read in packets, it is easier to compress. Data in columns is also easier to compress. This further reduces the I/O volume.
  3. Due to the reduced I/O, more data fits in the system cache.

For example, the query “count the number of records for each advertising platform” requires reading one “advertising platform ID” column, which takes up 1 byte uncompressed. If most of the traffic was not from advertising platforms, you can expect at least 10-fold compression of this column. When using a quick compression algorithm, data decompression is possible at a speed of at least several gigabytes of uncompressed data per second. In other words, this query can be processed at a speed of approximately several billion rows per second on a single server. This speed is actually achieved in practice.

輸入輸出 

  1. 對于分析查詢,隻需要讀取少量的表列。在面向列的資料庫中,您可以隻讀取您需要的資料。例如,如果您需要 100 列中的 5 列,您可以預期 I/O 減少 20 倍。
  2. 由于資料是以包的形式讀取的,是以更容易壓縮。列中的資料也更容易壓縮。這進一步減少了 I/O 量。
  3. 由于減少了 I/O,系統緩存中可以容納更多資料。

例如,查詢“統計每個廣告平台的記錄數”需要讀取一個“廣告平台ID”列,未壓縮占用1個位元組。如果大部分流量不是來自廣告平台,那麼您可以預期此列至少會壓縮 10 倍。使用快速壓縮算法時,資料解壓縮速度至少可以達到每秒幾 GB 的未壓縮資料。換句話說,該查詢可以在單個伺服器上以每秒大約數十億行的速度處理。這個速度實際上是在實踐中達到的。

CPU 

Since executing a query requires processing a large number of rows, it helps to dispatch all operations for entire vectors instead of for separate rows, or to implement the query engine so that there is almost no dispatching cost. If you do not do this, with any half-decent disk subsystem, the query interpreter inevitably stalls the CPU. It makes sense to both store data in columns and process it, when possible, by columns.

There are two ways to do this:

  1. A vector engine. All operations are written for vectors, instead of for separate values. This means you do not need to call operations very often, and dispatching costs are negligible. Operation code contains an optimized internal cycle.
  2. Code generation. The code generated for the query has all the indirect calls in it.

This is not done in “normal” databases, because it does not make sense when running simple queries. However, there are exceptions. For example, MemSQL uses code generation to reduce latency when processing SQL queries. (For comparison, analytical DBMSs require optimization of throughput, not latency.)

Note that for CPU efficiency, the query language must be declarative (SQL or MDX), or at least a vector (J, K). The query should only contain implicit loops, allowing for optimization.

中央處理器 

由于執行查詢需要處理大量行,是以它有助于為整個向量而不是單獨的行排程所有操作,或者實作查詢引擎以便幾乎沒有排程成本。如果你不這樣做,對于任何半體面的磁盤子系統,查詢解釋器不可避免地會停止 CPU。将資料存儲在列中并在可能的情況下按列進行處理是有意義的。

有兩種方法可以做到這一點:

  1. 矢量引擎。所有操作都是為向量編寫的,而不是為單獨的值編寫的。這意味着您不需要經常調用操作,并且排程成本可以忽略不計。操作碼包含優化的内部循環。
  2. 代碼生成。為查詢生成的代碼中包含所有間接調用。

這不是在“普通”資料庫中完成的,因為在運作簡單查詢時它沒有意義。但是,也有例外。例如,MemSQL 使用代碼生成來減少處理 SQL 查詢時的延遲。(相比之下,分析型 DBMS 需要優化吞吐量,而不是延遲。)

請注意,為了 CPU 效率,查詢語言必須是聲明性的(SQL 或 MDX),或者至少是向量(J,K)。查詢應該隻包含隐式循環,允許優化。

相關學習資料

連結:https://pan.baidu.com/s/1DjCYs9PhqJpdKqoqXk2fjQ  密碼:lk21

繼續閱讀