使用Zeppelin分析電子商務消費行為分析

任務描述
- 需求概述
問題分析
- Customer表
- Transaction表
- Store表
- Review表
- 資料結構
連接配接至Zeppelin
- 1.資料擷取
- - Download Data
- 2.檢查行數和header行
- - Understand the Data
- 3.上傳資料到HDFS
- - Upload the file to HDFS
- 4.建表查表
- - 4.1 Clear all tables if exists
  - 4.2 Verify all Tables are Created
- 5.資料清洗
- - 5.1 Clean and Mask customer_details
  - 5.2 Clean transaction_details into partition table
  - 5.3 Clean store_review table
- 6.Customer分析
- - 6.1找出顧客最常用的信用卡
  - 6.2找出客戶資料中排名前五的職位名稱
  - 6.3在美國女性最常用的信用卡
  - 6.4按性别和國家進行客戶統計
- 7.Transaction分析
- - 7.1計算每月總收入
  - 7.2計算每個季度的總收入
  - 7.3按年計算總收入
  - 7.4按工作日計算總收入
  - 7.5按時間段計算總收入(需要清理資料)
  - 7.6按時間段計算平均消費
  - 7.7按工作日計算平均消費
  - 7.8計算年、月、日的交易總數
  - 7.9找出交易量最大的10個客戶
  - 7.10找出消費最多的前10位顧客
  - 7.11統計該期間交易數量最少的使用者
  - 7.12計算每個季度的獨立客戶總數
  - 7.13計算每周的獨立客戶總數
  - 7.14計算整個活動客戶平均花費的最大值
  - 7.15統計每月花費最多的客戶
  - 7.16統計每月通路次數最多的客戶
  - 7.17按總價找出最受歡迎的5種産品
  - 7.18根據購買頻率找出最暢銷的5種産品
  - 7.19根據客戶數量找出最受歡迎的5種産品
  - 7.20驗證前5個details
- 8.Store分析
- - 8.1按客流量找出最受歡迎的商店
  - 8.2根據顧客消費價格找出最受歡迎的商店
  - 8.3根據顧客交易情況找出最受歡迎的商店
  - 8.4根據商店和唯一的顧客id擷取最受歡迎的産品
  - 8.5擷取每個商店的員工與顧客比
  - 8.6按年和月計算每家店的收入
  - 8.7按店鋪制作總收益餅圖
  - 8.8找出每個商店最繁忙的時間段
  - 8.9找出每家店的忠實顧客
  - 8.10根據每位員工的最高收入找出明星商店
- 9.Review分析
- - 9.1在ext_store_review中找出存在沖突的交易映射關系
  - 9.2了解客戶評價的覆寫率
  - 9.3根據評分了解客戶的分布情況
  - 9.4根據交易了解客戶的分布情況
  - 9.5客戶給出的最佳評價是否總是同一家門店

電子商務消費行為分析檔案

提取碼：39r1

任務描述

需求概述

對某零售企業最近一年門店收集的資料進行資料分析

潛在客戶畫像
使用者消費統計
門店的資源使用率
消費的特征人群定位
資料的可視化展現

問題分析

Customer表

customer_details	details
customer_id	Int，1-500
first_name	string
last_name	string
email	string，such as [email protected]
gender	string，Male or Female
address	string
country	string
language	string
job	string，job title/position
credit_type	string，credit card type。such as visa
credit_no	strin，credit card number

問題：language字段資料存在錯誤

Transaction表

transaction_details	details
transaction_id	Int，1-1000
customer_id	Int，1-500
store_id	Int，1-5
price	decimal，such as 5.08
product	string，things bought
date	string，when to purchase
time	string，what time to purchase

問題：表中transa_id有重複，但資料有效，需要修複資料

Store表

store_details	details
store_id	Int，1-5
store_name	string
employee_number	Int，在store有多少employee

Review表

store_review	details
stransaction_id	Int，1-8000
store_id	Int，1-5
review_store	Int，1-5

問題：表中有效的score資料

表中有獎transaction_id映射到錯誤的store_id

資料結構

Customer表

customer_details

Transaction表

transaction_details

Store表

store_details

Review表

store_review

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

連接配接至Zeppelin

Zeppelin安裝和配置

打開Zeppelin網站

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

導入電子商務消費行為分析資料及模闆

1.資料擷取

Download Data

%sh
--建立資料庫
mkdir /mnt/hivetest
cd /mnt/hivetest
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/customer_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/transaction_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_review.csv

2.檢查行數和header行

Understand the Data

%sh
## /tmp/data/
cd /mnt/hivetest
head -2 customer_details.csv
#head -2 store_details.csv
#head -2 store_review.csv
#head -2 transaction_details.csv

3.上傳資料到HDFS

Upload the file to HDFS

%sh
hdfs dfs -mkdir -p /apps/shopping/customer
hdfs dfs -put /mnt/hivetest/customer_details.csv /apps/shopping/customer
hdfs dfs -mkdir -p /apps/shopping/transaction
hdfs dfs -put /mnt/hivetest/transaction_details.csv /apps/shopping/transaction
hdfs dfs -mkdir -p /apps/shopping/store
hdfs dfs -put /mnt/hivetest/store_details.csv /apps/shopping/store
hdfs dfs -mkdir -p /apps/shopping/review
hdfs dfs -put /mnt/hivetest/store_review.csv /apps/shopping/review
##/tmp/shopping/data/customer/
##/tmp/shopping/data/transaction/
##/tmp/shopping/data/store/
##/tmp/shopping/data/review/

4.建表查表

4.1 Clear all tables if exists

%hive
create database shopping
use shopping

%hive
--建立顧客表
drop table ext_transaction_details
%hive
create external table if not exists ext_customer_details(
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string)
row format SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/customer'

%hive
--建立交易流水表
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(10,2),
product string,
purchase_date string,
purchase_time string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/transaction'

%hive
--建立商店詳情表
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/store'
tblproperties("skip.head.line.count"="1")

%hive
--建立評價表
create external table if not exists ext_store_review(
stransaction_id string,
store_id string,
review_score int)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/review'

4.2 Verify all Tables are Created

%hive
--select * from ext_customer_details limit 20
--select distinct language from ext_customer_details
--select * from ext_transaction_details limit 20
select transaction_id,count(transaction_id) cnt from ext_transaction_details group by transaction_id order by cnt desc 
--select * from ext_store_details limit 20
--select * from ext_store_review limit 20

5.資料清洗

解決以下有問題的資料

對transaction_details中的重複資料生成新ID
過濾掉store_review中沒有評分的資料
可以把清洗好的資料放到另一個表或者用View表示
找出PII（personal information identification）或PCI（personal confidential information）資料進行加密或hash
重新組織transaction資料按照日期YYYY-MM做分區

5.1 Clean and Mask customer_details

%hive
-- 敏感資訊加密
-- drop view vm_customer_details
create view if not exists vm_customer_details as
select
customer_id ,
first_name ,
unbase64(last_name) lastname,
unbase64(email) email,
gender ,
unbase64(address) address,
country ,
language,
job ,
credit_type ,
unbase64(credit_no) credit_no
from 
ext_customer_details

5.2 Clean transaction_details into partition table

%hive
-- 建立流水詳情表
create table if not exists transaction_details
(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date date,
purchase_time string
)
partitioned by(purchase_month string)
-- select transaction_id,count(1) from ext_transaction_details group by transaction_id having count(1)>1
-- select * from ext_transaction_details where transaction_id=8001
set hive.exec.dynamic.partition.mode=nonstrict -- 開啟動态分區
-- 重寫資料
with base as (
select
transaction_id,
customer_id ,
store_id ,
price ,
product,
purchase_date,
purchase_time,
from_unixtime(unix_timestamp(purchase_date,'yyyy-MM-dd'),'yyyy-MM') as purchase_month,
row_number() over (partition by transaction_id order by store_id) as rn
from ext_transaction_details
)
insert overwrite table transaction_details partition(purchase_month)
select
if(rn=1,transaction_id,concat_ws('-',transaction_id,'_fix')) ,
customer_id ,
store_id ,
price ,
product,
purchase_date ,
purchase_time,
purchase_month
from base
-- 檢視修複資訊
select * from transaction_details where transaction_id like '%fix%'

5.3 Clean store_review table

%hive
create view if not exists vw_store_review as
select
transaction_id,
review_score
from  ext_store_review where review_score <> ''
show tables

最終會有七個表

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

6.Customer分析

6.1找出顧客最常用的信用卡

%hive
select credit_type,count(credit_type) cnt from customer_details group by credit_type order by cnt desc

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

6.2找出客戶資料中排名前五的職位名稱

%hive
select job ,count(1) as pn from customer_details group by job
order by pn desc limit 5

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

6.3在美國女性最常用的信用卡

%hive
select credit_type,count(credit_type) cnt from customer_details where gender='Female' and country='United States' group by credit_type order by cnt desc limit 10

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

6.4按性别和國家進行客戶統計

%hive
select country,gender,count(1) cnt from customer_details group by country,gender

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.Transaction分析

7.1計算每月總收入

%hive
select month,store_id,sum(price) total from transaction_details group by store_id,month

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.2計算每個季度的總收入

%hive
select year(purchase_date) year,concat
("第",floor((month(purchase_date)-1)/3)+1,"季度") season,sum(price)
 from transaction_details group by year(purchase_date) ,
 concat("第",floor((month(purchase_date)-1)/3)+1,"季度") order by year desc,season

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.3按年計算總收入

%hive
select year(purchase_date) yy,sum(price) toatal
from transaction_details group by year(purchase_date) order by yy desc

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.4按工作日計算總收入

%hive
select dayofweek(purchase_date) work_date,sum(price) from transaction_details
 where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.5按時間段計算總收入(需要清理資料)

%hive
-- 使用正規表達式清理資料然後使用case when 分組查詢
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1)
select t2.timeSplit,sum(price)
from t2 
group by t2.timeSplit

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.6按時間段計算平均消費

%hive
-- 使用正規表達式清理資料然後使用case when 分組查詢
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1)
select t2.timeSplit,avg(price)
from t2 
group by t2.timeSplit

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.7按工作日計算平均消費

%hive
select dayofweek(purchase_date) work_date,avg(price) from transaction_details
 where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.8計算年、月、日的交易總數

%hive
--select purchase_date,count(1) from transaction_details group by purchase_date
--select concat(year(purchase_date),"-",month(purchase_date)),count(1) from transaction_details group by year(purchase_date),month(purchase_date)
--select year(purchase_date),count(1) from transaction_details group by year(purchase_date)\
select count(1) over(partition by year(purchase_date)) year,
       count(1) over(partition by year(purchase_date),month(purchase_date)) month,
       count(1) over(partition by year(purchase_date),month(purchase_date),day(purchase_date))day from transaction_details

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.9找出交易量最大的10個客戶

%hive
select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a desc limit 10

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.10找出消費最多的前10位顧客

%hive
select customer_id,sum(price)a from transaction_details group by customer_id order by a desc limit 10

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.11統計該期間交易數量最少的使用者

%hive
select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a limit 1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.12計算每個季度的獨立客戶總數

%hive
select year(purchase_date) yy,
concat("季度",floor((month(purchase_date)-1)/3)+1) season,
count(distinct customer_id) from transaction_details group by year(purchase_date),
concat("季度",floor((month(purchase_date)-1)/3)+1) 
order by yy desc,season

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.13計算每周的獨立客戶總數

%hive
select concat(year(purchase_date),'年第',weekofyear(purchase_date),'周'),count(distinct customer_id) from transaction_details group by year(purchase_date),weekofyear(purchase_date)

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.14計算整個活動客戶平均花費的最大值

%hive
select a.customer_id,max(a.av) from (select customer_id,avg(price)av from
 transaction_details group by customer_id)a group by a.customer_id

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.15統計每月花費最多的客戶

%hive
select b.m,b.id,b.s
from(
select a.m,a.id,a.s ,row_number() over(partition by  a.m order by a.s desc) as win1
from(
select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id,sum(price) s
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id)a) b 
where b.win1=1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.16統計每月通路次數最多的客戶

%hive
select b.m,b.id,b.c
from(
select a.m,a.id,a.c,row_number() over(partition by a.m order by a.c desc) as win1 
from(
select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id, count(1) c
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id) a) b 
where b.win1=1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.17按總價找出最受歡迎的5種産品

%hive
select product,sum(price)a from transaction_details group by product order by a desc limit 5

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.18根據購買頻率找出最暢銷的5種産品

%hive
select product,count(transaction_id)a from transaction_details group by product order by a desc limit 5

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.19根據客戶數量找出最受歡迎的5種産品

%hive
select product,count(distinct customer_id)a from transaction_details group by product order by a desc limit 5

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

7.20驗證前5個details

%hive
select * from transaction_details where product in ('Goat - Whole Cut')

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.Store分析

8.1按客流量找出最受歡迎的商店

%hive
select store_id,count(transaction_id) a from transaction_details group by store_id order by a desc limit 1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.2根據顧客消費價格找出最受歡迎的商店

%hive
select store_id,sum(price)a from transaction_details group by store_id order by a desc limit 1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.3根據顧客交易情況找出最受歡迎的商店

%hive
select store_id,count(transaction_id)a,sum(price)b from transaction_details group by store_id order by a desc,b desc limit 1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.4根據商店和唯一的顧客id擷取最受歡迎的産品

%hive
select b.store_id,b.product
from (
select a.store_id,a.product,a.c ,row_number() over(partition by store_id order by a.c desc )as win1 
from(
select store_id,product,count(distinct customer_id) c 
from transaction_details
group by store_id,product) a )b 
where b.win1 =1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.5擷取每個商店的員工與顧客比

%hive
select a.store_id,concat_ws(':',cast(ceil(round(s.employee_number/a.c*100))as string),'100')
from(
select t.store_id,count(distinct customer_id) c
from transaction_details t 
group by t.store_id)a join ext_store_details s 
on a.store_id=s.store_id

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.6按年和月計算每家店的收入

%hive
select store_id,year(purchase_date),month(purchase_date), sum(price) from transaction_details group by store_id,year(purchase_date),month(purchase_date)

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.7按店鋪制作總收益餅圖

%hive
select store_id,sum(price)a from transaction_details group by store_id

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.8找出每個商店最繁忙的時間段

%hive
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
					0,
					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
				 else 'night'
			end as timeSplit
from t1),
t3 as(
select t2.store_id,t2.timeSplit,count(1) c 
from t2 
group by t2.store_id,t2.timeSplit),
t4 as(
select t3.store_id,t3.timeSplit,row_number() over(partition by store_id order by t3.timeSplit desc)as win1
from t3 )
select t4.store_id,t4.timeSplit
from t4
where t4.win1=1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.9找出每家店的忠實顧客

%hive
select b.* from (select store_id,customer_id,count(transaction_id)a from transaction_details group by store_id,customer_id)b where b.a>=5

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

8.10根據每位員工的最高收入找出明星商店

%hive
-- 求總收入與雇員比值的最大值
with
t1 as (
select  store_id,sum(price) s 
from transaction_details 
group by store_id)
select t1.store_id,t1.s/s.employee_number ss
from t1 join ext_store_details s  on s.store_id= t1.store_id
order by ss desc 
limit 1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

9.Review分析

9.1在ext_store_review中找出存在沖突的交易映射關系

%hive
select stransaction_id
from store_review
group by stransaction_id
having count(1)>1

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

9.2了解客戶評價的覆寫率

%hive
-- 求各個店共有多少顧客評價
with 
t1 as(
select t2.store_id,t1.transaction_id,t2.customer_id
from store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id)
select t1.store_id,count(distinct t1.customer_id)
from t1 
group by t1.store_id

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

9.3根據評分了解客戶的分布情況

%hive
-- 求每家店每個評分有多少個客戶給的
with
t1 as(
select  t2.store_id ,t1.review_score,t2.customer_id
from store_review t1 join  transaction_details t2 on t1.transaction_id=t2.transaction_id)
select t1.store_id,t1.review_score,count(distinct customer_id)
from t1
group by t1.store_id,t1.review_score

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

9.4根據交易了解客戶的分布情況

%hive
-- 求每家店每個客戶的訂單數
select store_id,customer_id,count(1)
from transaction_details
group by store_id,customer_id

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

9.5客戶給出的最佳評價是否總是同一家門店

%hive
with yes as(select customer_id,count(distinct s.store_id)cnt,"yes" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt=1),
 no as(select customer_id,count(distinct s.store_id)cnt,"no" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt>1)
select * from yes union all select * from no

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

大資料——項目實戰（使用Zeppelin分析電子商務消費行為分析）任務描述問題分析連接配接至Zeppelin

使用Zeppelin分析電子商務消費行為分析

任務描述

需求概述

問題分析

Customer表

Transaction表

Store表

Review表

資料結構

連接配接至Zeppelin

1.資料擷取

Download Data

2.檢查行數和header行

Understand the Data

3.上傳資料到HDFS

Upload the file to HDFS

4.建表查表

4.1 Clear all tables if exists

4.2 Verify all Tables are Created

5.資料清洗

5.1 Clean and Mask customer_details

5.2 Clean transaction_details into partition table

5.3 Clean store_review table

6.Customer分析

6.1找出顧客最常用的信用卡

6.2找出客戶資料中排名前五的職位名稱

6.3在美國女性最常用的信用卡

6.4按性别和國家進行客戶統計

7.Transaction分析

7.1計算每月總收入

7.2計算每個季度的總收入

7.3按年計算總收入

7.4按工作日計算總收入

7.5按時間段計算總收入(需要清理資料)

7.6按時間段計算平均消費

7.7按工作日計算平均消費

7.8計算年、月、日的交易總數

7.9找出交易量最大的10個客戶

7.10找出消費最多的前10位顧客

7.11統計該期間交易數量最少的使用者

7.12計算每個季度的獨立客戶總數

7.13計算每周的獨立客戶總數

7.14計算整個活動客戶平均花費的最大值

7.15統計每月花費最多的客戶

7.16統計每月通路次數最多的客戶

7.17按總價找出最受歡迎的5種産品

7.18根據購買頻率找出最暢銷的5種産品

7.19根據客戶數量找出最受歡迎的5種産品

7.20驗證前5個details

8.Store分析

8.1按客流量找出最受歡迎的商店

8.2根據顧客消費價格找出最受歡迎的商店

8.3根據顧客交易情況找出最受歡迎的商店

8.4根據商店和唯一的顧客id擷取最受歡迎的産品

8.5擷取每個商店的員工與顧客比

8.6按年和月計算每家店的收入

8.7按店鋪制作總收益餅圖

8.8找出每個商店最繁忙的時間段

8.9找出每家店的忠實顧客

8.10根據每位員工的最高收入找出明星商店

9.Review分析

9.1在ext_store_review中找出存在沖突的交易映射關系

9.2了解客戶評價的覆寫率

9.3根據評分了解客戶的分布情況

9.4根據交易了解客戶的分布情況

9.5客戶給出的最佳評價是否總是同一家門店

繼續閱讀