Hanlp分詞1.7版本在Spark中分布式使用記錄

2019-03-10 23:50:00

新釋出1.7.0版本的hanlp自然語言處理工具包差不多已經有半年時間了，最近也是一直在整理這個新版本hanlp分詞工具的相關内容。不過按照目前的整理進度，還需要一段時間再給大家詳細分享整理的内容。昨天正好看到的這篇關于關于1.7.0版本hanlp分詞在spark中的使用介紹的文章，順便分享給大家一起學習一下！

以下為分享的文章内容：

HanLP分詞，如README中所說，如果沒有特殊需求，可以通過maven配置，如果要添加自定義詞典，需要下載下傳“依賴jar包和使用者字典".

直接"java xf hanlp-1.6.8-sources.jar" 解壓源碼，把源碼加入工程（依賴本地jar包，有些麻煩，有時候到伺服器有找不到jar包的情況)

按照文檔操作，在Spark中分詞，預設找的是本地目錄，是以如果是在driver中分詞是沒有問題的。但是如果要分布式分詞，是要把詞典目錄放在HDFS上面，因為這樣每台機器才可以通路到【參考代碼】

最好把新增詞典放在首位，第一次使用時,HanLP會把新增txt檔案，生成bin檔案，這個過程比較慢。但是隻需要跑一次，它會把bin檔案寫到HDFS路徑上面，第二次以後速度就快一些了。

注意到issue中說，隻可以在mapPartition中使用。

參考scala代碼

class HadoopFileIoAdapter extends IIOAdapter {

override def create(path: String): java.io.OutputStream = {

val conf: Configuration = new Configuration()

val fs: FileSystem = FileSystem.get(URI.create(path), conf)

fs.create(new Path(path))

}

override def open(path: String): java.io.InputStream = {

fs.open(new Path(path))

}

def myfuncPerPartition_ ( iter : Iterator [String] ) : Iterator[(Int, mutable.Buffer[String])] = {

println("run in partition")

val keyWordNum = 6

HanLP.Config.IOAdapter = new HadoopFileIoAdapter

val ret = iter.filter(_.split(",",2).length==2)

.map(line=>(line.split(",",2)(1).trim.hashCode, HanLP.extractKeyword(line.split(",",2)(0),keyWordNum)

.map(str=>str.filterNot(stopChar.contains(_))).filter(w=>(w.length>1 || ( w.length==1 && white_single_word.contains(w(0))) ))

.filterNot(stopWords.contains(_)).take(keyWordNum).distinct))

ret

}

//調用

raw_data.repartition(100).mapPartitions(myfuncPerPartition_)

---------------------

Hanlp分詞1.7版本在Spark中分布式使用記錄

繼續閱讀

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

maven No compiler is provided in this environment. Perhaps you are running on a JRE rather than a J

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

Opendaylight課堂之深度剖析toaster（一）