天天看点

TPC-DS 测试Impala On Parquet和Impala On Kudu

文章目录

    • 导入Kudu数据
    • TPC-DS测试
      • SQL准备
      • 批量执行脚本
      • 生成测试结果

导入Kudu数据

假设已经使用hive-testbench在Hive中生成TPC-DS数据集了。参考上一篇文章基于hive-testbench实现TPC-DS测试

TPC-DS 测试Impala On Parquet和Impala On Kudu

接下来我们还需要生成Kudu的表和数据,先准备SQL:

drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};

drop table if exists call_center;
create table ${VAR:DB}.call_center
PRIMARY KEY (cc_call_center_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.call_center;

drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
PRIMARY KEY (cp_catalog_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_page;

drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
PRIMARY KEY (cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk)
PARTITION BY HASH(cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_returns;

drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
PRIMARY KEY (cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk)
PARTITION BY HASH(cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_sales;

drop table if exists customer_address;
create table ${VAR:DB}.customer_address
PRIMARY KEY (ca_address_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_address;

drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
PRIMARY KEY (cd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_demographics;

drop table if exists customer;
create table ${VAR:DB}.customer
PRIMARY KEY (c_customer_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer;

drop table if exists date_dim;
create table ${VAR:DB}.date_dim
PRIMARY KEY (d_date_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.date_dim;

drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
PRIMARY KEY (hd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.household_demographics;

drop table if exists income_band;
create table ${VAR:DB}.income_band
PRIMARY KEY (ib_income_band_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.income_band;

drop table if exists inventory;
create table ${VAR:DB}.inventory
PRIMARY KEY (inv_date_sk,inv_item_sk,inv_warehouse_sk)
PARTITION BY HASH(inv_date_sk,inv_item_sk,inv_warehouse_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.inventory;

drop table if exists item;
create table ${VAR:DB}.item
PRIMARY KEY (i_item_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.item;

drop table if exists promotion;
create table ${VAR:DB}.promotion
PRIMARY KEY (p_promo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.promotion;

drop table if exists reason;
create table ${VAR:DB}.reason
PRIMARY KEY (r_reason_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.reason;

drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
PRIMARY KEY (sm_ship_mode_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.ship_mode;

drop table if exists store_returns;
create table ${VAR:DB}.store_returns
PRIMARY KEY (sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk)
PARTITION BY HASH(sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_returns;

drop table if exists store_sales;
create table ${VAR:DB}.store_sales
PRIMARY KEY (ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk)
PARTITION BY HASH(ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_sales;

drop table if exists store;
create table ${VAR:DB}.store
PRIMARY KEY (s_store_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store;

drop table if exists time_dim;
create table ${VAR:DB}.time_dim
PRIMARY KEY (t_time_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.time_dim;

drop table if exists warehouse;
create table ${VAR:DB}.warehouse
PRIMARY KEY (w_warehouse_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.warehouse;

drop table if exists web_page;
create table ${VAR:DB}.web_page
PRIMARY KEY (wp_web_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_page;

drop table if exists web_returns;
create table ${VAR:DB}.web_returns
PRIMARY KEY (wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk)
PARTITION BY HASH(wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_returns;

drop table if exists web_sales;
create table ${VAR:DB}.web_sales
PRIMARY KEY (ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk)
PARTITION BY HASH(ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk) PARTITIONS 3
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_sales;

drop table if exists web_site;
create table ${VAR:DB}.web_site
PRIMARY KEY (web_site_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_site;
           

然后使用impala-shell执行该SQL文件

impala-shell -l -u cloudera  --auth_creds_ok_in_clear -f kudu.sql --var=DB=tpcds_kudu_10 --var=HIVE_DB=tpcds_text_10
           

注意这里是针对开启权限限制的Impala,执行impala-shell时需要输入用户名和密码,若是没有启用权限认证的impala集群可以不用指定验证

impala-shell -f kudu.sql --var=DB=tpcds_kudu_10 --var=HIVE_DB=tpcds_text_10
           

执行完成后创建了Kudu的测试集数据库

TPC-DS 测试Impala On Parquet和Impala On Kudu

在HUE中执行如下SQL计算表的状态

compute stats call_center ;
compute stats catalog_page ;
compute stats catalog_returns ;
compute stats catalog_sales ;
compute stats customer_address ;
compute stats customer_demographics ;
compute stats customer ;
compute stats date_dim ;
compute stats household_demographics ;
compute stats income_band ;
compute stats inventory ;
compute stats item ;
compute stats promotion ;
compute stats reason ;
compute stats ship_mode ;
compute stats store_returns ;
compute stats store_sales ;
compute stats store ;
compute stats time_dim ;
compute stats warehouse ;
compute stats web_page ;
compute stats web_returns ;
compute stats web_sales ;
compute stats web_site ;
           

TPC-DS测试

SQL准备

准备好99条基准测试SQL,注意TPC-DS官方给出的SQL,部分语法Impala是不支持的,会执行失败。

笔者是在cloudera官方开源的一个impala-tpcds-kit项目里获取的查询SQL,地址:http://github.com/cloudera/impala-tpcds-kit

这里面目前只有79条SQL,可见有部分SQL不支持被删掉了,将项目下载到本地,解压后把queries文件夹拿出来。

后来我在另一个仓库里面发现了全部的Impala测试SQL,地址:http://github.com/fayson/cdhproject/tree/master/impala-tpcds/queries,不过没有做测试,感兴趣的可以测试一下

批量执行脚本

这里我使用了beeline的方式去执行SQL,因为我们的集群有权限认证,impala-shell好像不支持在一行命令执行时输入用户名和密码

21050是Impala beeline连接的端口

#!/bin/bash
database_name=tpcds_kudu_10
current_path=`pwd`
queries_dir=${current_path}/queries
rm -rf logs
mkdir logs
for t in `ls ${queries_dir}`
do
    echo "current query will be ${queries_dir}/${t}"
    beeline  -n username -p password -u jdbc:hive2://localhost:21050/${tpcds_bin_partitioned_parquet_10} -f  ${queries_dir}/${t} &>logs/${t}.log
done
echo "all queries execution are finished, please check logs for the result!"
           

所有的执行结果都被保存在了logs文件夹下,然后使用如下命令获取所有SQL的执行时间

grep selected *.log
           
TPC-DS 测试Impala On Parquet和Impala On Kudu

生成测试结果

使用如上的脚本对Impala On Parquet和Impala On Kudu测试完成后,将拿到的结果处理后导入到Excel中生成对比条形图

TPC-DS 测试Impala On Parquet和Impala On Kudu

可以看到大部分查询Kudu与Parquet差别不大,甚至部分查询kudu还要更快一点

注意这里的展示结果是我在10G的数据量上做的,在更大的数据量时结果也会有差别

  • 参考TPC-DS在大数据中的使用