Hive通過自定義UDF函數實作分詞

2021-12-19 23:50:00

一、所需依賴

<dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.1.0</version>
        </dependency>
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>

二、實作代碼

package com.link.datawarehouse.hive;

/**
 * @author 包菜
 * @date 2020/12/8 15:08
 */
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
public class IkParticiple extends UDF {
    public String evaluate(String input) {
        // 如果輸入為空，則直接傳回空即可
        String output="";
        if (input == null || input.trim().length() == 0) {
            return null;
        }
        //JiebaSegmenter segmenter = new JiebaSegmenter();
        // output=segmenter.sentenceProcess(input).toString().replaceAll(", ", " ").toLowerCase();
        byte[] bt = input.getBytes();
        InputStream ip = new ByteArrayInputStream(bt);
        Reader read = new InputStreamReader(ip);
        IKSegmenter iks = new IKSegmenter(read, true);
        Lexeme t;
        try {
            while ((t = iks.next()) != null) {
                output=output+t.getLexemeText().toLowerCase()+" ";
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return output;
    }

    /*測試使用的main方法*/
    public static void main(String[] args) {
        System.out.println(new IkParticiple().evaluate("超級喜歡寫代碼"));
    }
}

三、資料結果

四、打包上傳，建立函數

注意：自定義UDF函數隻能在相應的庫使用

select linkdata_warehouse.fenciqi('超級喜歡寫代碼');

Hive通過自定義UDF函數實作分詞

一、所需依賴

二、實作代碼

三、資料結果

四、打包上傳，建立函數

繼續閱讀

ASP程式設計經典例子

ASP程式設計中20個非常有用的例子

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

龍珠訓練營task04

阿裡雲天池龍珠計劃SQL訓練營打卡

阿裡雲天池龍珠計劃SQL訓練營day1

實驗樓sql進階之成績管理系統的資料操作(window)

HiveQl語句應用執行個體：WordCount具體步驟如下：

Oracle的基本操作

SQL優化SQL語句優化的目的

JAVA高效程式設計指南

關于SQL語言

SQL語言基礎：常用的資料查詢語句

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

neo4j之cypher使用文檔

sqlServer根據經緯查距離