hive 與 sql

原文位址：hive與sql 的點滴

1. 順序

SQL的執行順序：

第一步：執行FROM
第二步：WHERE條件過濾
第三步：GROUP BY分組
第四步：執行SELECT投影列
第五步：HAVING條件過濾
第六步：執行ORDER BY 排序

I2. sql常用操作:

DISTINCT

即去重.

as

: 定義别名.

having:

查詢平均年齡在20歲以上的班級

錯誤,(聚合函數在

WHERE

之後執行，是以這裡在WHERE判斷條件裡加入聚合函數是做不到的),下面sql是正确

SELECT student_class,AVG(student_age) AS 平均年齡 FROM t_student GROUP BY (student_class) HAVING
AVG(student_age)>20;

通配符過濾用like 單字元通配符‘_' 多字元通配符‘%'

A LIKE B 模糊比對 A RLIKE B 正規表達式比對

WHERE column_name BETWEEN value1 AND value2 
操作符 BETWEEN ... AND 會選取介于兩個值之間的資料範圍。這些值可以是數值、文本或者日期。

group by : 分組,隻能針對某一個列或者某個列的聚合函數
having : 針對聚合函數分組的過濾(多行), where 針對每一行的過濾
group by 多個字段, group by name,number ，我們可以把 name 和 number 看成一個整體字段，以他們整體來進行分組的
判斷是否null, where field is null

安全模式下無法批量更新
批量更新時候，安全模式關閉，結束後再開啟
SET SQL_SAFE_UPDATES = 0;
update table_name set field_name= 1000+filed_name;
SET SQL_SAFE_UPDATES = 1;

更新update操作

insert行

alter增加字段

I3.sql聚合函數彙總

count():求滿足條件的記錄數(COUNT()函數會忽略指定列的值為空的行，但如果 COUNT()函數中用的是星号（*），則不忽略)(count(1))
max():求最大值
min():求最小值
sum():求和隻能對數值進行計算不能運用于日期和字元串
avg():求平均值

I4. hive内置函數

hive内置函數

floor(21.2),傳回21
round(21.263,2),傳回21.26
abs(double a)絕對值
from_unixtime(tmsp): 傳回目前時區的時間YYYY- MM – DD HH：MM：SS
unix_timestamp():如果不帶參數的調用，傳回一個Unix時間戳（從’1970- 01 – 0100:00:00′到現在的UTC秒數）為無符号整數
unix_timestamp(string date): 指定日期參數調用UNIX_TIMESTAMP（），它傳回參數值’1970- 01 – 0100:00:00′到指定日期的秒數。
date(timesp): 将時間戳轉換為年-月-日
to_date(string timestamp): 傳回時間中的年月日： to_date(“1970-01-01 00:00:00″) = “1970-01-01″
year(string date),month(string date),day(string date)等
length(string A): 傳回字元串的長度
reverse(string A): 傳回倒序字元串
concat(string A, string B…):連接配接多個字元串，合并為一個字元串
lower(string A): 小寫
upper(string A): 大寫
get_json_object(string json_string, string path): 處理json資料
regexp_extract(string subject, string pattern, int index): 通過下标傳回正規表達式指定的部分。regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’
parse_url(string urlString, string partToExtract [, string keyToExtract]):傳回URL指定的部分。parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1′, ‘HOST’) 傳回：’facebook.com’
A IN (val1, val2, ...)
A NOT IN (val1, val2, ...)

json_tuple():Takes a set of names (keys) and a JSON string, and returns a tuple of values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call.

example:
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

應該用下面替換上面的:

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

parse_url_tuple比parser_url更高效
資料:
url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject

create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY ' ' location '/test/url';

SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;
結果:
url1 facebook.com /path1/p.php k1=v1&k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL

I5. hive聚合函數

包括了sql的聚合函數
var_pop(col): 傳回指定列的方差
var_samp(col): 傳回指定列的樣本方差
還有偏差，協方差，相關系數等

I6. hive常用指令

hive常用互動指令官方
hive官方手冊指令中文
複合資料結構與一些函數用法

hive的常用參數如下

-d,–define <key=value> Variable substitution to apply to hive commands. e.g. -d A=B or –define A=B
-e SQL from command line
-f SQL from files
-i Initialization SQL file
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode (echo executed SQL to the console)

需要說明的

LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
local表示從本地導入資料到table
沒有local表示從hdfs導入資料
關鍵詞 'OVERWRITE' 意味着目前表中已經存在的資料将會被删除掉。
如果沒有給出 'OVERWRITE'，則意味着資料檔案将追加到目前的資料集中。

通過Hive導出到本地檔案系統
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/t_hive' SELECT * FROM t_hive;
沒有local則導出到hdfs中(s3)

CREATE EXTERNAL TABLE table1(date_1 string,resorce string,data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile
LOCATION 's3://**';
建立s3(or hdfs)上的映射表

insert overwrite directory "s3://**"
select *
from table1;
查詢table1内容查詢結果導入到s3(or hdfs)裡

關于分區:
分區表使得查詢時指定分區而不必查詢整個表，節約時間
分區表的字段在整個表裡其實就是一個普通的字段,每個分區以檔案夾的形式單獨存在表檔案夾的目錄下

建（映射表)分區:
create external table table_name(filed type) PARTITIONED BY (partition_field type)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile
location '**'
添加資料到分區:
alter table table_name add IF NOT EXISTS partition(partition_field='**') location '**';
删除分區:
ALTER TABLE table_name DROP partition_spec, partition_spec,...

将本地資料添加到分區中:
load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table rable_name partition (ds='2015-11-11'); 
往一個分區表的某個分區中添加資料:
insert overwrite table table_name partition (ds='**') select id,max(name) from test group by id; 
分區中查詢:
select * from table_name where partition_field ='**';

hive 與 sql

繼續閱讀

Hbase自帶工具

分布式Apache ZooKeeper-3.4.6叢集安裝

HBase學習之路(八)：HBase協處理器+Hadoop的表操作行為日志監控案例實戰

spark(十)RDD的groupByKey和reduceByKey實作

Spark性能調優之RDD持久化

docker安裝cloudera版本的hadoop

pyspark中ip地理位置統計案例代碼實作

Hadoop開源大資料平台逐漸淘汰了嗎？Hadoop是一個開源的大資料處理架構，它在過去的幾年中一直是大資料領域中最流行

介紹大資料YARN容器。YARN是Hadoop生态系統中的關鍵元件，也是實作大資料處理任務管理和排程的重要工具。通過将資

spark-shell

HIVE 基本指令hive筆記：

Cloudera Manager安裝Kafka叢集

HIve安裝踩坑手冊一、安裝環境二、安裝Hadoop三、安裝hive四、錯誤記錄

spark為什麼比hadoop快