hive和impala操作parquet檔案timestamp帶來的困擾

2023-05-03 08:28:57

前言：準備使用hive作資料倉庫，因曆史遺留問題，原先遺留的資料處理都是impala處理的，資料檔案是parquet檔案，因本身叢集資源少，而處理的檔案很大，準備使用hive離線分析将小檔案推送到db或者impala進行展示操作。

準備：搭建cdh5.9，将原有的資料從一個叢集遷移到現有的叢集。對資料按照天進行動态分區，分區資料仍然使用parquet格式。

問題：因分區字段為timestamp類型，一個偶然的機會發現了一個詭異的問題，hive查詢的時間比impala查詢的時間多了8個小時，和原始資料進行比對發現hive處理的timestamp資料有問題。

Based on this discussion it seems that when support for saving timestamps in Parquet was added to Hive, the primary goal was to be compatible with Impala's implementation, which probably predates the addition of the

timestamp_millis

type to the Parquet specification.

Impala's timestamp representation maps to the

int96

Parquet type (4 bytes for the date, 8 bytes for the time, details in the linked discussion).

So no, storing a Hive timestamp in Parquet does not use the

timestamp_millis

type, but Impala's

int96

timestamp representation instead.

以上是查到的問題的原因，因英文不好，不是很難就不在作翻譯了。

說說的我的解決措施吧，因我準備後期長期使用hive 而不是使用impala 固将資料timestamp 添加 to_utc_timestamp(insert_time, 'GMT+8') 進行轉換，函數不懂可以自己去查詢下哈，然後重新分區使用orcfile（簡單說下orcfile格式，列式存儲，資料檔案占用空間小）格式進行存儲。

悲催的是impala不支援orcfile格式的資料檔案，無奈隻能選擇妥協方案，大資料檔案使用hive離線處理，資料結果推送到impala或者db，儲存格式為impala支援的格式。

僅以此文紀念為解決此問題死傷的腦細胞！

hive和impala操作parquet檔案timestamp帶來的困擾

繼續閱讀

luogu1078 文化之旅

Hadoop離線_Hive的基本操作

Hive中内部表、外部表、分區、分桶以及SQL的執行順序

Hive中的内部表外部表及分區表

Hive---外部分區表的建立

Hive學習筆記 3 Hive的資料模型：内部表、分區表、外部表、桶表、視圖

HiveQL(二):分區表

Hive的分區表入門

Hive的分區表

Hive（二）--分區分桶，内部表外部表

大資料高頻面試題之Hive的小檔案合并

世界因大資料而改變

hive sql通過具體位址解析出行政區劃(省＞市＞區＞縣＞鄉＞鎮＞村)

Hive最全常見錯誤及解決方案hive --service metastore &

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

HiveQl語句應用執行個體：WordCount具體步驟如下：