parquet-tools工具
目前有兩種parquet-tools工具
1、wesleypeck編寫的開源parquet-tools(使用偏多,且可定制)
parquet-tools出現org/apache/hadoop/conf/Configuration問題的解決
該版本由于原作者不在進行更新,目前網上能夠找到的版本大部分無法使用,原因在于源碼中pom.xml并沒有引入對應hadoop-core的依賴,導緻jar包在執行對應指令時會報錯:
NoClassDefFoundError: org/apache/hadoop/conf/Configuration
或執行指令無反應 僅僅輸出如下内容:
等一系列關于hadoop的問題
解決方法:
(1)下載下傳連結提供的工具
jar:
https://download.csdn.net/download/weixin_42532968/87431652
tar.gz:
https://download.csdn.net/download/weixin_42532968/87431657
(2)下載下傳對應源碼,在pom依賴中添加如下檔案,重新進行打包,使用其提供的對應tar.gz檔案
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
如需jar包需額外加入如下依賴:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
parquet.tools.Main
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
生成如下檔案:
即可解決缺少hadoop相關元件的問題
使用方式:
//檢視parquet檔案中字段DEVICE_NUMBER的dump資訊
parquet_tools dump -c DEVICE_NUMBER -d /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的dump資訊
parquet_tools dump -d /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的前10行内容
parquet_tools head -n 10 /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的meta資訊
parquet_tools meta /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的schema資訊
parquet_tools schema /opt/trafodion/bss_userinfo_20180812_0
2、Apache Arrow的parquet-tools
安裝
pip install parquet-tools
使用
parquet-tools --help
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet CLI tools
positional arguments:
{show,csv,inspect}
show Show human readble format. see `show -h`
csv Cat csv style. see `csv -h`
inspect Inspect parquet file. see `inspect -h`
optional arguments:
-h, --help show this help message and exit
舉例
$ parquet-tools show test.parquet
+-------+-------+---------+
| one | two | three |
|-------+-------+---------|
| -1 | foo | True |
| nan | bar | False |
| 2.5 | baz | True |
+-------+-------+---------+
$ parquet-tools inspect /path/to/parquet
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 2226
############ Columns ############
one
two
three
############ Column(one) ############
name: one
path: one
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(two) ############
name: two
path: two
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(three) ############
name: three
path: three
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
$ parquet-tools csv s3://bucket-name/test.parquet |csvq "select one, three where three"
+-------+-------+
| one | three |
+-------+-------+
| -1.0 | True |
| 2.5 | True |
+-------+-------+