天天看點

本地CSV導入hive表

背景 從mysql 導出的資料儲存為csv檔案

導出兩個檔案,一個有csv的檔案頭 record1.csv

一個沒有csv檔案的檔案頭 record0.csv

try1

建表語句直接導入

sql='''
    
CREATE  TABLE IF NOT EXISTS default.records
(
    exer_recore_id  BIGINT,
    user_id int,
    channel tinyint,
    item_id int,
    question_id int,
    `type` tinyint,
    question_num int,
    orgin tinyint,
    orgin_id int,
    cate_id int,
    is_hf tinyint,
    user_answer string,
    correct_answer string,
    is_correct tinyint comment "正确與否",
    dutration int,
    record_id int,
    subject_1 int,
    subject_2 int,
    subject_3 int,
    status tinyint,
    is_signed tinyint,
    create_time int,
    practice_level_1 string,
    practice_level_2 string,
    practice_level_3 string,
    update_time int
)
comment "做題記錄"

    '''
 spark.sql("LOAD DATA LOCAL INPATH '*/Desktop/record0.csv'  OVERWRITE INTO TABLE "
              "records")
   df=spark.sql("select exer_recore_id,user_id,is_correct from records  ")
    df.show()
    df.printSchema()
           

輸出

+--------------+-------+----------+
|exer_recore_id|user_id|is_correct|
+--------------+-------+----------+
|          null|   null|      null|
|          null|   null|      null|
|          null|   null|      null|
|          null|   null|      null|
|          null|   null|      null|
           

導入失敗

try 2

建表語句加入 分隔符

comment "做題記錄"
ROW FORMAT delimited fields terminated by ','
           
  • res 失敗

    加入了csv 的分隔符’,'依舊失敗

文本檢視檔案

如下圖

本地CSV導入hive表

try3

comment "做題記錄"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
      "separatorChar" = ",",
      "quoteChar"     = "'"
    )

           

success

下一個問題(表頭不導入)

加入屬性tblproperties(“skip.header.line.count”=“1”)

‘1’ 代表忽略第一行

comment "做題記錄"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
      "separatorChar" = ",",
      "quoteChar"     = "'"
    )
tblproperties("skip.header.line.count"="1")
           

解決

again

大意失荊州啊

剛才的輸出

+------------------+-------+----------+---------+
|    exer_recore_id|user_id|is_correct|record_id|
+------------------+-------+----------+---------+
|"1001110000021295"|   "11"|       "0"|    "109"|
|"1001110000021296"|   "11"|       "0"|    "109"|
|"1001110000021297"|   "11"|       "1"|    "109"|

root
 |-- exer_recore_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- is_correct: string (nullable = true)
 |-- record_id: string (nullable = true)
           

全是字元串格式,還帶着“

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
      "separatorChar" = ",",
      "quoteChar"     = "\\"",
      "escapeChar"    = "\\\\"        
    )
tblproperties("skip.header.line.count"="1")
           

分隔符是

"

輸出

+----------------+-------+----------+---------+
|  exer_recore_id|user_id|is_correct|record_id|
+----------------+-------+----------+---------+
|1001110000021295|     11|         0|      109|
|1001110000021296|     11|         0|      109|
           

繼續閱讀