天天看點

【Druid】IndexGeneratorJob源碼

【Druid】IndexGeneratorJob源碼

功能

hadoop_index任務中,IndexGeneratorJob負責啟動一個MapReduce任務将原始資料形成Segment導入Druid的Storage中。是以,IndexGeneratorJob主要邏輯集中在IndexGeneratorMapper,IndexGeneratorReducer類中。

IndexGeneratorMapper

按行從hadoopShardSpecLookup條件中判斷該行資料屬于哪個Bucket,hadoopShardSpecLookup是determine_partitions過程計算結果,儲存的是目前hadoop_index任務的資料分區規則。hadoopShardSpecLookup的産生邏輯在DeterminePartitionsJob或者DetermineHashedPartitionsJob中。

【Druid】IndexGeneratorJob源碼

Mapper輸出格式中,格式<SortableBytes(groupKey, sortedKey), SerializedRow>。

groupKey包含了shardNum,time,partitionNum。MapReduce任務再Partition過程中用到shardNum作為分區的判斷條件,DateTime time辨別目前bucket所在的資料起始時間範圍,partitionNum目前bucket下拆分的partition。

sortedKey包含truncated timestamp和hashed dimensions。reduce的spilling是什麼過程,為什麼有sortedKey就能減少reducer的spill?

SerializedRow,将一個inputRow二進制序列化

context.write(
    new SortableBytes(
        bucket.get().toGroupKey(),
        // sort rows by truncated timestamp and hashed dimensions to help reduce spilling on the reducer side
        ByteBuffer.allocate(Long.BYTES + hashedDimensions.length)
                  .putLong(truncatedTimestamp)
                  .put(hashedDimensions)
                  .array()
    ).toBytesWritable(),
    new BytesWritable(serializeResult.getSerializedRow())
);           

IndexGeneratorReducer

【Druid】IndexGeneratorJob源碼

IndexGeneratorJob最重要邏輯都集中在IndexGeneratorReducer當中

1、PersistExecutor,用來持久化index檔案

2、按行處理資料,累加總行數(maxRowCount)和記憶體資料量(maxBytesInMemory),并判斷是否達到門檻值。

如果達到門檻值,會将該Index進行持久化生成檔案。

persist(index, interval, mergedBase, progressIndicator);過程當中調用IndexMergerV9.persist()方法,形成version.bin、factory.json、0000.smoosh、meta.smoosh等檔案。

疑問:version.bin、factory.json、0000.smoosh、meta.smoosh這些檔案具體是什麼作用?

[gh@20:47:40]~$ ll /var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/base-flush4512362737889488592/merged
total 32
-rw-r--r--  1 gh  staff     4  9  3 20:47 version.bin
-rw-r--r--  1 gh  staff    29  9  3 20:47 factory.json
-rw-r--r--  1 gh  staff  1306  9  3 20:47 00000.smoosh
-rw-r--r--  1 gh  staff   135  9  3 20:47 meta.smoosh           

檔案中内容:

[gh@20:51:24]/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/base-flush4512362737889488592/merged$ cat version.bin

 [gh@20:51:26]/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/base-flush4512362737889488592/merged$ cat factory.json
{"type":"mMapSegmentFactory"}

[gh@20:51:29]/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/base-flush4512362737889488592/merged$ cat 00000.smoosh
d{"valueType":"LONG","hasMultipleValues":false,"parts":[{"type":"long","byteOrder":"LITTLE_ENDIAN"}]} �,'5I,'5Ig{"valueType":"COMPLEX","hasMultipleValues":false,"parts":[{"type":"complex","typeName":"hyperUnique"}]}(�d{"valueType":"LONG","hasMultipleValues":false,"parts":[{"type":"long","byteOrder":"LITTLE_ENDIAN"}]} "d�2�{"valueType":"STRING","hasMultipleValues":false,"parts":[{"type":"stringDictionary","bitmapSerdeFactory":{"type":"roaring","compressRunOnSerialization":true},"byteOrder":"LITTLE_ENDIAN"}]}."a.example.comb.exmaple.com 8,:0:07'unique_hostsvisited_numhoshostI5',I:M�4{"type":"roaring","compressRunOnSerialization":true}{"container":{},"aggregators":[{"type":"hyperUnique","name":"unique_hosts","fieldName":"unique_hosts","isInputHyperUnique":false,"round":false},{"type":"longSum","name":"visited_num","fieldName":"visited_num","expression":null}],"timestampSpec":{"column":"timestamp","format":"yyyyMMddHH","missingValue":null},"queryGranularity":{"type":"none"},"rollup":true}

[gh@20:51:32]/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/base-flush4512362737889488592/merged$ cat meta.smoosh
v1,2147483647,1
__time,0,0,150
host,0,449,792
index.drd,0,792,947
metadata.drd,0,947,1306
unique_hosts,0,150,303
visited_num,0,303,449           
[gh@13:35:43]/private/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/junit2870916712364956185/junit7070039265224578711/website/2021-09-03T124308.104Z_4bd0315d3edf460daeb97264534a14a5/segmentDescriptorInfo$ ll
total 8
-rw-r--r--  1 gh  staff  612  9  4 13:34 website_2014-10-22T000000.000Z_2014-10-23T000000.000Z_2021-09-03T124308.104Z.json
[gh@13:35:43]/private/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/junit2870916712364956185/junit7070039265224578711/website/2021-09-03T124308.104Z_4bd0315d3edf460daeb97264534a14a5/segmentDescriptorInfo$ cat website_2014-10-22T000000.000Z_2014-10-23T000000.000Z_2021-09-03T124308.104Z.json
{
  "binaryVersion": 9,
  "dataSource": "website",
  "dimensions": "host",
  "identifier": "website_2014-10-22T00:00:00.000Z_2014-10-23T00:00:00.000Z_2021-09-03T12:43:08.104Z",
  "interval": "2014-10-22T00:00:00.000Z/2014-10-23T00:00:00.000Z",
  "loadSpec": {
    "path": "/private/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/junit2870916712364956185/junit7070039265224578711/website/2014-10-22T00:00:00.000Z_2014-10-23T00:00:00.000Z/2021-09-03T12:43:08.104Z/0/index.zip",
    "type": "local"
  },
  "metrics": "visited_num,unique_hosts",
  "shardSpec": {
    "partitionNum": 0,
    "partitions": 5,
    "type": "numbered"
  },
  "size": 1474,
  "version": "2021-09-03T12:43:08.104Z"
}
[gh@13:35:49]/private/var/folders/pv/dkqbk3wj51b7yb4r5wd9qgdw0000gn/T/junit2870916712364956185/junit7070039265224578711/website/2021-09-03T124308.104Z_4bd0315d3edf460daeb97264534a14a5/segmentDescriptorInfo$           

numReducers處理邏輯: