天天看點

parquet-tools工具使用parquet-tools工具

parquet-tools工具

目前有兩種parquet-tools工具

1、wesleypeck編寫的開源parquet-tools(使用偏多,且可定制)

parquet-tools出現org/apache/hadoop/conf/Configuration問題的解決

​ 該版本由于原作者不在進行更新,目前網上能夠找到的版本大部分無法使用,原因在于源碼中pom.xml并沒有引入對應hadoop-core的依賴,導緻jar包在執行對應指令時會報錯:

NoClassDefFoundError: org/apache/hadoop/conf/Configuration

或執行指令無反應 僅僅輸出如下内容:

parquet-tools工具使用parquet-tools工具

等一系列關于hadoop的問題

解決方法:

(1)下載下傳連結提供的工具

jar:
https://download.csdn.net/download/weixin_42532968/87431652
tar.gz:
https://download.csdn.net/download/weixin_42532968/87431657
           

(2)下載下傳對應源碼,在pom依賴中添加如下檔案,重新進行打包,使用其提供的對應tar.gz檔案

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-core</artifactId>
      <version>1.2.1</version>
    </dependency>
           

如需jar包需額外加入如下依賴:

<plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-assembly-plugin</artifactId>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>single</goal>
              </goals>
              <configuration>
                <archive>
                  <manifest>
                    <mainClass>
                      parquet.tools.Main
                    </mainClass>
                  </manifest>
                </archive>
                <descriptorRefs>
                  <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
              </configuration>
            </execution>
          </executions>
        </plugin>
           

生成如下檔案:

parquet-tools工具使用parquet-tools工具

即可解決缺少hadoop相關元件的問題

使用方式:

//檢視parquet檔案中字段DEVICE_NUMBER的dump資訊
parquet_tools dump -c DEVICE_NUMBER -d /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的dump資訊
parquet_tools dump  -d /opt/trafodion/bss_userinfo_20180812_0 
//檢視parquet檔案的前10行内容
parquet_tools head  -n 10 /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的meta資訊
parquet_tools meta /opt/trafodion/bss_userinfo_20180812_0
//檢視parquet檔案的schema資訊
parquet_tools schema /opt/trafodion/bss_userinfo_20180812_0
           

2、Apache Arrow的parquet-tools

安裝

pip install parquet-tools
           

使用

parquet-tools --help
usage: parquet-tools [-h] {show,csv,inspect} ...

parquet CLI tools

positional arguments:
  {show,csv,inspect}
    show              Show human readble format. see `show -h`
    csv               Cat csv style. see `csv -h`
    inspect           Inspect parquet file. see `inspect -h`

optional arguments:
  -h, --help          show this help message and exit
           

舉例

$ parquet-tools show test.parquet
+-------+-------+---------+
|   one | two   | three   |
|-------+-------+---------|
|  -1   | foo   | True    |
| nan   | bar   | False   |
|   2.5 | baz   | True    |
+-------+-------+---------+

$ parquet-tools inspect /path/to/parquet
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 2226


############ Columns ############
one
two
three

############ Column(one) ############
name: one
path: one
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE

############ Column(two) ############
name: two
path: two
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8

############ Column(three) ############
name: three
path: three
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE

$ parquet-tools csv s3://bucket-name/test.parquet |csvq "select one, three where three"
+-------+-------+
|  one  | three |
+-------+-------+
| -1.0  | True  |
| 2.5   | True  |
+-------+-------+