laitimes

Hive Programming Guide Reading Notes

author:JD Cloud developer

Preamble:

Recently, I was new to writing Hive SQL, but I found that many queries were not executing as fast as expected. In order to improve query efficiency, I read the Hive Programming Guide, hoping to find a way to optimize by understanding its underlying mechanism and lay the foundation for writing efficient SQL in the future. I would like to take note of this article.

1. Why did Hive come about?

先有Hadoop再有Hive

Hadoop implements a computational model, MapReduce, which can divide computational tasks into multiple processing units and then distribute them across a group of home or server-level hardware machines, thereby reducing computational costs and providing horizontal scalability. However, this programming model is complex and difficult for most data analysis analysts, and even Java development and writing MapReduce programs requires a lot of time and effort. Based on this, Hive provides a SQL-based query language (HiveQL), which allows users with SQL knowledge to easily use Hadoop for big data analysis, because the underlying layer of Hive will automatically convert these queries into MapReduce tasks.

2. Hive is a module

Hive Programming Guide Reading Notes

All commands and queries will enter the Driver, and the input will be parsed and compiled through the module, and the calculation of the requirements will be optimized, and then executed according to the specified steps.

Hive initializes the MapReduce task through JobTracker communication, and the data files to be processed are stored in HDFS, which is managed by NameNode.

The Metastore is a stand-alone relational database where Hive holds table schemas and other system metadata.

3. HQL execution process

Hive Programming Guide Reading Notes

In simple terms, Hive reads raw data from the Hadoop Distributed File System (HDFS) and then performs data processing on a single node (local mode) or a Hadoop cluster (cluster mode) based on the query definition. After the processing is complete, Hive will output the result to HDFS or other specified storage location.

So, where is the execution time of Hive mainly spent, and what are the parts that we can optimize?

Hive's execution time is mainly spent in the following stages:

1. Query compilation: Hive compiles the HiveQL query into a logical execution plan that describes how to execute the query. This phase includes syntax parsing, semantic parsing, generating a logical plan, logical plan optimization, and generating a physical plan (typically a MapReduce job).

2. Task scheduling: The compiled and generated MapReduce job is submitted to the resource manager of the Hadoop cluster (such as YARN) and waits for resource scheduling and job execution.

3. Data Read and Write: The process of reading data stored on HDFS and writing the final result to HDFS involves a lot of disk I/O operations, especially when dealing with large data sets.

4. MapReduce Job Execution: Included

1. Map stage: perform operations such as filtering and projection;

2. Shuffle stage: The intermediate data output by the Map task is transmitted on the network and sorted and merged on the Reduce node.

3. Reduce stage: perform operations such as aggregation and sorting;

5. Network Transfer: During the Shuffle phase of MapReduce, intermediate data needs to be transferred between cluster nodes, which can cause significant network latency.

In general, the execution time of MapReduce jobs (especially the Shuffle and Reduce phases) and the read and write operations of data are the most time-consuming parts of Hive queries, and they are also the main parts of our optimization process.

4. Common optimization methods for Hive

Local mode

-- 开启本地模式,默认为false
hive.exec.mode.local.auto=true

原理:有时Hive的输入数据量是非常小的。在这种情况下,为查询触发执行任务的时间消耗可能会比实际job的执行时间要多得多。对于大多数这种情况,Hive可以通过本地模式在单台机器上处理所有的任务。对于小数据集,执行时间可以明显被缩短。用户可以通过设计属性
hive.exec.mode.local.auto的值为true,来让Hive在适当的时候自动启动这个优化。实践有效,但如果并行执行的SQL过多,容易造成本地内存溢出。

           

map-side JOIN优化

#-- Hive v0.7之前需要通过添加标记 /*+ MAPJOIN(X) */ 触发,如下图           
Hive Programming Guide Reading Notes
-- Hive v0.7版本开始之后,通过设置hive.auto.convert.JOIN的值为true开启
set hive.auto.convert.JOIN=true
-- 设置小表的大下,单位为字节
set hive.mapjoin.smalltable.filesize=25000000

原理:如果所有表中有一个表足够得小,是可以完成载入内存中的,那么这时Hive可以执行一个map-side JOIN,将小表完全放到内存,Hive便可以直接和内存中的小表进行逐一匹配,从而减少所需要的reduce过程,有时甚至可以减少某些map task任务。

           

Concurrent execution

-- 通过设置参数hive.exec.parallel值为true,开启并发执行,默认为false
set hive.exec.parallel=true

原理:Hive会将一个查询转化成一个或者多个阶段。这样的阶段可以是MapReduce阶段、抽样阶段、合并阶段、limit阶段等。默认情况下,Hive一次只会执行一个阶段。但是有些阶段并非完全互相依赖的,也就是说这些阶段是可以并行执行的,这样可以使得整个job的执行时间缩短。
通过设置参数hive.exec.parallel值为true,就可以开启并发执行。 

           

Dynamic partition adjustments

-- 启用动态分区,默认为false;
SET hive.exec.dynamic.partition=true;
-- 启用动态分区模式为非严格模式。开启严格模式时们必须保证至少有一个分区时静态的。
SET hive.exec.dynamic.partition.mode=nonstrict;
-- 设置在一个动态分区插入操作中可以创建的最大分区数量
SET hive.exec.max.dynamic.partitions=1000;
-- 设置每个节点可以创建的最大分区数量
SET hive.exec.max.dynamic.partitions.pernode=100;

当执行查询时,如果查询条件包含分区键,Hive可以仅扫描相关分区的数据,从而减少了扫描的数据量,提高查询效率;在执行动态分区的插入时,这些分区也可以并行写入,从而提高了数据写入的并行度和性能。通过以上参数,可更好的使用动态分区。

           

Merge small files

--是否和并Map输出文件,默认true
SET hive.merge.mapfiles=true;
--是否合并 Reduce 输出文件,默认false
SET hive.merge.mapredfiles=true;
-- 设置合并文件的大小阈值
SET hive.merge.size.per.task=256000000; 
-- 设置小文件的平均大小阈值
SET hive.merge.smallfiles.avgsize=128000000; 

由于一些小批量的写入、MapReduce作业切割、数据倾斜等原因,Hive中可能会产生大量小文件,通过以上参数可进行小文件合并以减少读取文件时的开销、降低NameNode压力,提升查询效率。

           

Data skew optimization

Data skew refers to the fact that data is unevenly distributed to each node for processing during distributed processing, resulting in excessive load on some nodes and easy load on others, which affects the overall computing efficiency. Data skew can occur for the following reasons:

1. Uneven distribution of key values: some key values correspond to much more data than other key values;

2. Same key value: A large amount of data is grouped with the same key (such as null or a specific default value);

3. Unreasonable JOIN operation: When joining a large table, if a key value of a small table is unevenly distributed in the large table, the result of the JOIN will be skewed.

4. Unreasonable partitioning strategy: The actual distribution of data is not considered when partitioning data, resulting in uneven partitioning.

The main solutions are:

1. Custom partitioning strategy: Implement custom partitioning period, and make more reasonable partitioning according to the characteristics of data;

2. Extended key value: add a random prefix or number to the skewed key to spread it across multiple partitions;

3. Filter large key-value data: Identify skewed key values (such as null and null) and process them separately or filter out unimportant data.

Finally, the optimization methods commonly used in our relational databases also apply to Hive. For example, Hive also has the concept of indexes, which reduces the amount of input data in MapReduce by establishing indexes, but also needs to be carefully evaluated whether to use indexes like relational databases, because maintaining indexes also requires additional storage space, and creating indexes also consumes computing resources The EXTENDED statement can generate more output information, and you can check it out for yourself if you are interested.

Overall, this book was very rewarding for me as a beginner in learning to write HQL, which gave me a basic understanding of Hive and a deeper understanding of the SQL I wrote. However, the Hive in the book should be a lower version, and it may be different from what we are using now, but it is enough to enter the door. In addition to the content of the book, there are some things in this article that I personally understand, if there are any mistakes, please correct them.

Author: Ma Zhuang

Source: JD Cloud Developer Community Please indicate the source for reprinting

Read on