文章目录
-
- 导入Kudu数据
- TPC-DS测试
-
- SQL准备
- 批量执行脚本
- 生成测试结果
导入Kudu数据
假设已经使用hive-testbench在Hive中生成TPC-DS数据集了。参考上一篇文章基于hive-testbench实现TPC-DS测试
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLicGcq5ybzBTa0ATb5AzMqNTY6FGZ5F3cycWM5dWO3MmT0ZDMw8CXldmchx2Lc5WaucWbpFmbpNnLzM3dvw1LcpDc0RHaiojIsJye.jpg)
接下来我们还需要生成Kudu的表和数据,先准备SQL:
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};
drop table if exists call_center;
create table ${VAR:DB}.call_center
PRIMARY KEY (cc_call_center_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.call_center;
drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
PRIMARY KEY (cp_catalog_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_page;
drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
PRIMARY KEY (cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk)
PARTITION BY HASH(cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_returns;
drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
PRIMARY KEY (cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk)
PARTITION BY HASH(cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_sales;
drop table if exists customer_address;
create table ${VAR:DB}.customer_address
PRIMARY KEY (ca_address_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_address;
drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
PRIMARY KEY (cd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_demographics;
drop table if exists customer;
create table ${VAR:DB}.customer
PRIMARY KEY (c_customer_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer;
drop table if exists date_dim;
create table ${VAR:DB}.date_dim
PRIMARY KEY (d_date_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.date_dim;
drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
PRIMARY KEY (hd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.household_demographics;
drop table if exists income_band;
create table ${VAR:DB}.income_band
PRIMARY KEY (ib_income_band_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.income_band;
drop table if exists inventory;
create table ${VAR:DB}.inventory
PRIMARY KEY (inv_date_sk,inv_item_sk,inv_warehouse_sk)
PARTITION BY HASH(inv_date_sk,inv_item_sk,inv_warehouse_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.inventory;
drop table if exists item;
create table ${VAR:DB}.item
PRIMARY KEY (i_item_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.item;
drop table if exists promotion;
create table ${VAR:DB}.promotion
PRIMARY KEY (p_promo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.promotion;
drop table if exists reason;
create table ${VAR:DB}.reason
PRIMARY KEY (r_reason_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.reason;
drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
PRIMARY KEY (sm_ship_mode_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.ship_mode;
drop table if exists store_returns;
create table ${VAR:DB}.store_returns
PRIMARY KEY (sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk)
PARTITION BY HASH(sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_returns;
drop table if exists store_sales;
create table ${VAR:DB}.store_sales
PRIMARY KEY (ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk)
PARTITION BY HASH(ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_sales;
drop table if exists store;
create table ${VAR:DB}.store
PRIMARY KEY (s_store_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store;
drop table if exists time_dim;
create table ${VAR:DB}.time_dim
PRIMARY KEY (t_time_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.time_dim;
drop table if exists warehouse;
create table ${VAR:DB}.warehouse
PRIMARY KEY (w_warehouse_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.warehouse;
drop table if exists web_page;
create table ${VAR:DB}.web_page
PRIMARY KEY (wp_web_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_page;
drop table if exists web_returns;
create table ${VAR:DB}.web_returns
PRIMARY KEY (wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk)
PARTITION BY HASH(wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_returns;
drop table if exists web_sales;
create table ${VAR:DB}.web_sales
PRIMARY KEY (ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk)
PARTITION BY HASH(ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_sales;
drop table if exists web_site;
create table ${VAR:DB}.web_site
PRIMARY KEY (web_site_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_site;
然后使用impala-shell执行该SQL文件
impala-shell -l -u cloudera --auth_creds_ok_in_clear -f kudu.sql --var=DB=tpcds_kudu_10 --var=HIVE_DB=tpcds_text_10
注意这里是针对开启权限限制的Impala,执行impala-shell时需要输入用户名和密码,若是没有启用权限认证的impala集群可以不用指定验证
impala-shell -f kudu.sql --var=DB=tpcds_kudu_10 --var=HIVE_DB=tpcds_text_10
执行完成后创建了Kudu的测试集数据库
在HUE中执行如下SQL计算表的状态
compute stats call_center ;
compute stats catalog_page ;
compute stats catalog_returns ;
compute stats catalog_sales ;
compute stats customer_address ;
compute stats customer_demographics ;
compute stats customer ;
compute stats date_dim ;
compute stats household_demographics ;
compute stats income_band ;
compute stats inventory ;
compute stats item ;
compute stats promotion ;
compute stats reason ;
compute stats ship_mode ;
compute stats store_returns ;
compute stats store_sales ;
compute stats store ;
compute stats time_dim ;
compute stats warehouse ;
compute stats web_page ;
compute stats web_returns ;
compute stats web_sales ;
compute stats web_site ;
TPC-DS测试
SQL准备
准备好99条基准测试SQL,注意TPC-DS官方给出的SQL,部分语法Impala是不支持的,会执行失败。
笔者是在cloudera官方开源的一个impala-tpcds-kit项目里获取的查询SQL,地址:http://github.com/cloudera/impala-tpcds-kit
这里面目前只有79条SQL,可见有部分SQL不支持被删掉了,将项目下载到本地,解压后把queries文件夹拿出来。
后来我在另一个仓库里面发现了全部的Impala测试SQL,地址:http://github.com/fayson/cdhproject/tree/master/impala-tpcds/queries,不过没有做测试,感兴趣的可以测试一下
批量执行脚本
这里我使用了beeline的方式去执行SQL,因为我们的集群有权限认证,impala-shell好像不支持在一行命令执行时输入用户名和密码
21050是Impala beeline连接的端口
#!/bin/bash
database_name=tpcds_kudu_10
current_path=`pwd`
queries_dir=${current_path}/queries
rm -rf logs
mkdir logs
for t in `ls ${queries_dir}`
do
echo "current query will be ${queries_dir}/${t}"
beeline -n username -p password -u jdbc:hive2://localhost:21050/${tpcds_bin_partitioned_parquet_10} -f ${queries_dir}/${t} &>logs/${t}.log
done
echo "all queries execution are finished, please check logs for the result!"
所有的执行结果都被保存在了logs文件夹下,然后使用如下命令获取所有SQL的执行时间
grep selected *.log
生成测试结果
使用如上的脚本对Impala On Parquet和Impala On Kudu测试完成后,将拿到的结果处理后导入到Excel中生成对比条形图
可以看到大部分查询Kudu与Parquet差别不大,甚至部分查询kudu还要更快一点
注意这里的展示结果是我在10G的数据量上做的,在更大的数据量时结果也会有差别
- 参考TPC-DS在大数据中的使用