天天看點

大資料分析-常見分布式OLAP查詢引擎研究

大資料查詢分析-分布式OLAP查詢引擎設計

Bigdata OLAP Product= Cube + OLAP(QueryParser+QueryOptimizer+QueryEngine) + FileSystem(HDFS/GFS/S3)

OLAP Type Link Desc
Mondrian ROLAP http://mondrian.pentaho.com/documentation/architecture.php 多元資料模組化
Impala BQ-OLAP https://github.com/cloudera/Impala/wiki 互動式SQLonHadoop(容錯性差)
PrestoDB https://prestodb.io/ 互動式SQLonHadoop
Dremel 互動式BigQuery
Kylin MOLAP http://kylin.apache.org/ 預處理&Cache
Druid RT-OLAP http://druid.io/ 增量計算&搜尋引擎
Pinot https://github.com/linkedin/pinot 增量計算
  • OLAP類型:ROLAP/BigQuery-OLAP/RT-OLAP/MOLAP,其中Kylin是一種針對大資料場景設計的特殊MOLAP
  • 目前OLAP技術領域有大資料量分析需求,不包含查詢引擎與資料存儲優化的輕量級方案(Mondrian)應用場景受限
  • 越來越多關注查詢本身而不是過多關注Cube/Dimension/Measure等傳統資料倉庫所遵循的通用标準,這樣才能更加适應目前大資料領域的現實需求
  • Tableau作為優秀可視化分析工具對大資料量分析能力有所欠缺

OLAP場景的關鍵特征

  • 大多數是讀請求
  • 資料總是以相當大的批次(> 1000 rows)進行寫入
  • 不修改已添加的資料
  • 每次查詢都從資料庫中讀取大量的行,但是同時又僅需要少量的列 寬表,即每個表包含着大量的列
  • 較少的查詢(通常每台伺服器每秒數百個查詢或更少)
  • 對于簡單查詢,允許延遲大約50毫秒
  • 列中的資料相對較小: 數字和短字元串(例如,每個URL 60個位元組) 處理單個查詢時需要高吞吐量(每個伺服器每秒高達數十億行)
  • 事務不是必須的
  • 對資料一緻性要求低
  • 每一個查詢除了一個大表外都很小 查詢結果明顯小于源資料,換句話說,資料被過濾或聚合後能夠被盛放在單台伺服器的記憶體中

1.ROLAP引擎 - Mondrian

特點:多元資料模組化+無内置查詢引擎

[Mondriad-ROLAP分析]

Comments:靈活BI多是基于Mondrain架構搭建的OLAP查詢引擎
           

2.SQLonHadoop查詢引擎 - (Impala/Presto/Dremel/Redshift)

特點:DistrubutedSQLQueryEngine分布互動式查詢

  • [SQLonHadoop技術分析]
  • [Impala]
  • [PrestoDB]
  • BigQuery&Dremel
  • Amazon Redshift

3.關于Kylin-MOLAP

特點:Cube預處理+多元資料查詢(QueryEngine應弱于Impala/Presto)

3.1.Kylin架構

大資料分析-常見分布式OLAP查詢引擎研究
  • DataSource:Hive
  • MapReduce聚合計算
  • Spark記憶體計算
  • AggregateTable:HBase
  • pre-aggregation預聚合
  • 增量CubeSegment/CubeSegmentMerge
  • Trie樹次元值編碼

3.2.

TechnicalConcepts
  • Star Schema/Snowflake Schema
  • Cube
  • DIMENSION & MEASURE
  • CUBE ACTIONS

3.3 表描述

  • FactTable
  • LookupTable(事實描述表)
  • DimensionTable

3.4 Measure計算

  • Sum
  • Count
  • Max
  • Min
  • Average
  • Distinct Count(based on HyperLogLog近似值估值)

4.RT(Realtime類搜尋)OLAP查詢引擎 - Druid/Pinot

Druid是基于MOLAP模型的空間換時間方案。優點在于查詢性能的整體提升,缺點在于資料多元分析的局限性

4.1.Druid特點

特點:類搜尋引擎+增量計算+資料實時寫入

  • 整個系統同時提供了對離線資料分析和線上實時資料分析的支援
  • 可插拔的查詢系統,支援多種存儲系統
    • 亞秒響應的互動式查詢。支援較高并發,為面向使用者的平台提供Olap查詢(注意這是相比其他OLAP的強大優勢)。
    • 支援實時導入,導入即可被查詢。支援高并發導入。
    • 采用分布式shared-nothing的架構,可以擴充到PB級。
    • 支援聚合函數,count和sum,以及使用javascript實作自定義UDF。
    • 支援複雜的Aggregator,近似查詢的Aggregator例如HyperLoglog以及Yahoo開源的DataSketches。
    • 支援Groupby,Select,Search查詢。(Groupby性能較差,推薦timeseries/TopN)
    • 不支援大表之間的Join,但其lookup功能滿足和次元表的Join
    • 列存儲,反向索引,RollUP(彙總/上卷),roaring或conciseBitmap位圖索引+LZ4資料壓縮

4.2.Druid架構分析

大資料分析-常見分布式OLAP查詢引擎研究

Druid ingestion specs define this granularity as the queryGranularity of the data. The lowest supported queryGranularity is millisecond.

Druid shards are called segments and Druid always first shards data by time.

Segments contain data stored in compressed column orientations, along with the indexes for those columns. Druid queries only understand how to scan segments.

Following search infrastructure, Druid creates immutable snapshots of data, stored in data structures highly optimized for analytic queries.

Druid is a column store, which means each individual column is stored separately. Druid indexes data on a per-shard (segment) level.

Druid has two means of ingestion, real-time and batch. Real-time ingestion in Druid is best effort.

One common approach to operating Druid is to have a real-time pipeline for recent insights, and a batch pipeline for the accurate copy of the data.

Druid's native query language is JSON over HTTP. Apache Calcite - SQL parser, planner and query engine whose Druid adapter can query data residing in Druid.

Druid is designed to perform single table operations and does not currently support joins. Many production setups do joins at ETL because data must be denormalized before loading into Druid.

Druid is designed to have no single point of failure. Different node types are able to fail without impacting the services of the other node types.

A.Storage

  • Segment:Druid反向索引+時間分片

    Segments contain the various dimensions and metrics in a data set, stored in a column orientation, as well as the indexes for those columns.

    Segments are stored in a "deep storage" LOB store/file system.

    Druid stores its index in segment files, which are partitioned by time(按時間分片).

    Segment core data structure: three basic column types: the timestamp column, dimension columns, and metric columns.

Sharding Data to Create Segments

  • Dimensions:Bitmap Index 次元:Bitmap索引

    Dimensions columns are different because they support filter and group-by operations, so each dimension requires the following three data structures:

    1.A dictionary that maps values (which are always treated as strings) to integer IDs,

    2.A list of the column’s values, encoded using the dictionary in 1

    3.For each distinct value in the column, a bitmap that indicates which rows contain that value.

The bitmaps in 3 -- also known as inverted indexes allow for quick filtering operations(specifically, bitmaps are convenient for quickly applying AND and OR operators).

The list of values in 2 is needed for group by and TopN queries.

  • Multi-value columns

B.核心子產品

  • Broker子產品:

route queries to if you want to run a distributed cluster. This node also merges the result sets from all of the individual nodes together.

Broker nodes employ a cache with a LRU cache invalidation strategy.

類似分布式搜尋引擎中的meta元搜尋引擎,他不負責任何Segment的查詢,他隻是一個代理,從Zookeeper中擷取TimeLine,這個 TimeLine記錄了intervals->List(Server)的mapping關系,接收到Client的請求以後,按照時間段在TimeLine查找Segment分布在那些 Server上。

  • Coordinator子產品:

responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.

負責協調Segment的均衡分發加載,Coordinator從meta資料存儲mysql/postgreSQL中擷取那些還未被加載的Segment,根據目前所有Historical的負載能力均衡地配置設定到其LoadQueue。

  • Historical子產品:

從Deep Storage中下載下傳Segment,采用mmap(記憶體映射)的方式加載Segment,并負責來自broker對這些Segment的查詢.

Historical nodes do not communicate directly with each other or with the coordinator nodes but instead rely on Zookeeper for coordination.

  • Indexing Service子產品:

The indexing service is a highly-available, distributed service that runs indexing related tasks.Indexing service tasks create (and sometimes destroy) Druid segments.

The indexing service is composed of three main components: a peon component that can run a single task, a Middle Manager component that manages peons, and an overlord component that manages task distribution to middle managers.

Druid的索引結構布局由字典,正排(列存儲)以及反向索引組成,其中倒排的PostingList采用壓縮LZ4的BitMap位圖索引。支援Consice和Roaring兩種BitMap方式

  • Realtime process子產品:

Realtime nodes will periodically build segments representing the data they’ve collected over some span of time and transfer these segments off to Historical nodes.

Realtime Node負責提供實時資料索引,生成realtime Index(Segment),并定期推送到Historical Node。在Realtime中采用LSM-Tree的模型

4.3.Druid Adapter & SQL Parser

Full SQL is currently not supported with Druid. (目前SQL支援無法與Druid原生查詢語言一樣靈活)

Calcite’s Druid adapter

allows you to query the data using SQL, combining it with data in other Calcite schemas.

4.4.Multitenancy Consideration

Multitenant workloads can either use a separate datasource for each tenant, or can share one or more datasources between tenants using a "tenant_id" dimension. When deciding which path to go down, consider that each path has pros and cons.

Shared datasources or datasource-per-tenant

Druid offical compromise is to use more than one datasource, but a smaller number than tenants.

Partitioning shared datasources

Customizing data distribution

Supporting high query concurrency

Druid's fundamental unit of computation is a segment. Nodes scan segments in parallel and a given node can scan druid.processing.numThreads concurrently.

Druid internally stores requests to scan segments in a priority queue.

4.5.Query Caching

Druid supports query result caching through an LRU cache. Results are stored on a per segment basis, along with the parameters of a given query.

4.6.Sorting Order

These sorting orders are used by the TopNMetricSpec, SearchQuery, GroupByQuery's LimitSpec, and BoundFilter.

  • Lexicographic
  • Alphanumeric
  • Numeric
  • Strlen
  • 不支援具體次元按度量排序功能