深入剖析 HIVE 的鎖和事務機制
1 從排程系統對 HIVE 作業的排程政策聊起
大家知道,依托 HIVE 建構資料倉庫時,我們一般會通過排程系統精心編排 HIVE SQL 作業,以避免多個作業并發寫同一張表或同一個表分區;同時 AD HOC 的資料查詢操作,在時間上一般也會盡量避開正常的 ETL 資料寫操作,以避免對同一張表或同一個表分區的并發讀寫操作。
排程系統之是以對 HIVE SQL 作業采取上述排程政策,究其原因,是因為多作業并發讀寫同一個表或同一個表分區時,底層會因為 HIVE 的鎖機制導緻大量鎖競争和等待,此時作業運作效率極低,是以業務應用隻能通過排程系統規避掉對同一張表或表分區的并發讀寫。
2. 一個因為排程系統配置不當導緻的 HIVE 作業運作效率低下的線上例子
筆者最近排查的某證券客戶的線上應用運作效率低下的原因,就是上述 HIVE SQL 作業排程政策不當,在此跟大家分享下。
- 問題現象:在進行 HIVE 作業的性能分析時,從 HIVE ON SPARK 作業的 SPARK WEB UI 上,我們觀察到,某個 SQL 作業對應的一系列 SPARK 任務執行完畢後,大部分 EXECUTORs 都被動态釋放歸還給了 YARN,但是後續過了七八分鐘又動态地申請了新的一批 EXECUTORS,以執行後續的 SQL 作業,這是不正常的,因為雖然我們開啟了 SPARK ON YARN 的動态資源配置設定機制,但是前後兩個 SQL 作業是在同一個 SESSION 會話中順序送出的,理想狀況下,SPARK EXECUTORS 不應該動态釋放又重新申請,除非前後兩個 SQL 的送出/執行間隔時長,大于參數 spark.dynamicAllocation.executorIdleTimeout 的值(該參數預設值 60s)
- 問題原因:經過排查應用日志和YARN 日志,發現前後兩個 SQL 确實是順序送出的,其送出間隔并沒有大于60s,同時 YARN 叢集中對應的資源隊列,當時也是有充足的資源的;進一步排查 HIVESERVER2 日志,發現 HS2 雖然在上一個 SQL 執行完畢後,及時收到了用戶端送出的新的 SQL 并成功完成了解析和編譯,但真正生成 SPARK 任務并送出執行,卻是在七八分鐘之後!正是由于這七八分鐘的時間間隔,觸發了SPARK ON YARN 動态資源配置設定機制下的 executor 的動态釋放和申請;
- 問題根本原因:仔細分析 HS2 中詳細的日志,結合業務方編寫的 HIVE SQL 業務代碼,發現上述 SQL 的解析和對應 SPARK 作業的執行之間的七八分鐘的時間間隔,正是由于 HIVE 表分區的鎖競争引起的: 業務方編寫了多個 HIVE SQL 作業,這些 SQL 作業通過 insert into 插入到了同一張表的同一個分區,在配置排程系統時,這些 HIVE SQL 作業被配置在了不同的流程中并被并發地排程執行,是以此時這些被并發排程的多個 HIVE SQL 作業在底層會因為對同一個表分區的的鎖競争引起等待,在上述案列中達到了七八分鐘,觸發了 spark executor 的動态釋放和申請;
- 問題解決方案:更改業務 SQL 和排程邏輯,不再多個任務并發寫同一個表分區,比如先分别寫資料到多個表例如 report_data1-5,最後再把資料 union 插入到 report_data 表.
3 HIVE 的鎖和事務機制
HIVE 傳統的鎖機制,概括如下:
- 非分區表:For a non-partitioned table:
- When the table is being read, a S lock is acquired; 讀取操作會擷取S鎖
- whereas an X lock is acquired for all other operations (insert into the table, alter table of any kind etc.) 其它操作會擷取X鎖
- 分區表:For a partitioned table, the idea is as follows:
- A 'S' lock on table and relevant partition is acquired when a read is being performed; 讀取操作或擷取表和相關分區的S鎖;
- For all other operations, an 'X' lock is taken on the partition. However, if the change is only applicable to the newer partitions, a 'S' lock is acquired on the table, whereas if the change is applicable to all partitions, a 'X' lock is acquired on the table. Thus, older partitions can be read and written into, while the newer partitions are being converted to RCFile. Whenever a partition is being locked in any mode, all its parents are locked in 'S' mode. 其它操作,會擷取分區的X鎖;(可能會擷取表的S或X鎖)
深入剖析 HIVE 的鎖和事務機制 如下官方連結,對此有詳細的描述:
https://cwiki.apache.org/confluence/display/Hive/Locking
在傳統的鎖機制的基礎上,為同時支援 HIVE ACID 事務表和非事務表,HIVE 推出了 transaction manager 和 lock manager:
- A new logical entity called "transaction manager" was added which incorporated previous notion of "database/table/partition lock manager",the transaction manager is now additionally responsible for managing of transactions locks;
- the default DummyTxnManager emulates behavior of old Hive versions: has no transactions and uses hive.lock.manager property to create lock manager for tables, partitions and databases; 配置 transaction manager 為 DummyTxnManager,以提供對傳統鎖機制的支援,此時不支援事務;
- A newly added DbTxnManager manages all locks/transactions in Hive metastore with DbLockManager (transactions and locks are durable in the face of server failure). This means that previous behavior of locking in ZooKeeper is not present anymore when transactions are enabled. 配置 transaction manager 為 DbTxnManager,以提供對事務的支援,此時底層通過 DbLockManager管理事務和鎖;
- To avoid clients dying and leaving transaction or locks dangling, a heartbeat is sent from lock holders and transaction initiators to the metastore on a regular basis. If a heartbeat is not received in the configured amount of time, the lock or transaction will be aborted.
- Note that the lock manager used by DbTxnManager will acquire locks on all tables, even those without "transactional=true" property. 當使用 DbTxnManager時,讀寫操作對事務表和非事務表都會加鎖;
- By default, Insert operation into a non-transactional table will acquire an exclusive lock and thus block other inserts and reads. While technically correct, this is a departure from how Hive traditionally worked (i.e. w/o a lock manger). 預設情況下,對非事務表的寫操作,會加 X 鎖;
- For backwards compatibility, hive.txn.strict.locking.mode (see table below) is provided which will make this lock manager acquire shared locks on insert operations on non-transactional tables. This restores previous semantics while still providing the benefit of a lock manager such as preventing table drop while it is being read. 配置hive.txn.strict.locking.mode為false後,對非事務表的寫操作,會加 S 鎖;
- Note that for transactional tables, insert always acquires share locks since these tables implement MVCC architecture at the storage layer and are able to provide strong read consistency (Snapshot Isolation) even in presence of concurrent modification operations. 對事務表的 INSERT 操作,會擷取 S 鎖,因為底層存儲層的 MVCC 機制確定了 SNAPSHOT ISOLATION,確定了并發度寫情況下的強一緻性;
4 事務和鎖的相關配置參數
事務和鎖主要的相關配置參數有:
- hive.txn.manager:
- 配置為 org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager 以提供類似 Hive-0.13 版本之前的傳統的鎖機制,此時不提供對 transactions 事務的支援;
- 配置為 org.apache.hadoop.hive.ql.lockmgr.DbTxnManager 以提供對事務表的支援;
- hive.lock.manager
- 配置為 org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager,基于 zk建構鎖,提供類似 Hive-0.13 版本之前的傳統的鎖機制,此時不提供對 transactions 事務的支援;
- 配置為 org.apache.hadoop.hive.ql.lockmgr.DbLockManager,基于 metastore db建構鎖,提供了對事務表的支援;
- hive.support.concurrency: true/false, 需要配置為 TURE 才能支援并發讀寫和事務;
- hive.txn.strict.locking.mode:true/false:
- In strict mode non-ACID resources use standard R/W lock semantics, e.g. INSERT will acquire exclusive lock. 在 STRICT 模式下,非事務表擷取的也是傳統的 R/W讀寫鎖,此時 INSERT 擷取的時 X排他鎖;
- In non-strict mode, for non-ACID resources, INSERT will only acquire shared lock, which allows two concurrent writes to the same partition but still lets lock manager prevent DROP TABLE etc. when the table is being written to (as of Hive 2.2.0). 在非嚴格模式下,INSERT 非事務表時會擷取S鎖,即允許多個作業并發寫同一個非事務表;
- 需要說明的是,當 hive.txn.manager 配置為 org.apache.hadoop.hive.ql.lockmgr.DbTxnManager 時,不管 hive.lock.manager 配置為 org.apache.hadoop.hive.ql.lockmgr.DbLockManager還是 org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager,其底層實際使用的都是 org.apache.hadoop.hive.ql.lockmgr.DbLockManager,見源碼:
- hive.txn.timeout
- hive.txn.heartbeat.threadpool.size
-
spark.dynamicAllocation.enabled
spark.dynamicAllocation.{initial/min/max}Executors
- spark.dynamicAllocation.executorAllocationRatio
- spark.dynamicAllocation.schedulerBacklogTimeout
- spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
- spark.dynamicAllocation.executorIdleTimeout
- spark.dynamicAllocation.cachedExecutorIdleTimeout
- hive.metastore.server.min.threads
- hive.server2.thrift.min.worker.threads
- hive.server2.thrift.max.worker.threads
- HIVE-15774:Ensure DbLockManager backward compatibility for non-ACID resources
- https://cwiki.apache.org/confluence/display/Hive/Locking
- https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager