Spark中分布式使用HanLP（1.7.0)分詞示例

2019-05-07 23:50:00

HanLP分詞，如README中所說，如果沒有特殊需求，可以通過maven配置，如果要添加自定義詞典，需要下載下傳“依賴jar包和使用者字典".

分享某大神的示例經驗:

是直接"java xf hanlp-1.6.8-sources.jar" 解壓源碼，把源碼加入工程（依賴本地jar包，有些麻煩，有時候到伺服器有找不到jar包的情況)

按照文檔操作，在Spark中分詞，預設找的是本地目錄，是以如果是在driver中分詞是沒有問題的。但是如果要分布式分詞，是要把詞典目錄放在HDFS上面，因為這樣每台機器才可以通路到【參考代碼】

最好把新增詞典放在首位（沒有放在首位好像沒有生效).第一次使用時,HanLP會把新增txt檔案，生成bin檔案，這個過程比較慢。但是隻需要跑一次，它會把bin檔案寫到HDFS路徑上面，第二次以後速度就快一些了。

注意到issue中說，隻可以在mapPartition中使用

參考scala代碼

class HadoopFileIoAdapter extends IIOAdapter {

override def create(path: String): java.io.OutputStream = {

val conf: Configuration = new Configuration()
val fs: FileSystem = FileSystem.get(URI.create(path), conf)
fs.create(new Path(path))

}

override def open(path: String): java.io.InputStream = {

val conf: Configuration = new Configuration()
val fs: FileSystem = FileSystem.get(URI.create(path), conf)
fs.open(new Path(path))

def myfuncPerPartition_ ( iter : Iterator [String] ) : Iterator[(Int, mutable.Buffer[String])] = {

println("run in partition")
  val keyWordNum = 6
  HanLP.Config.IOAdapter = new HadoopFileIoAdapter
  val ret = iter.filter(_.split(",",2).length==2)
    .map(line=>(line.split(",",2)(1).trim.hashCode, HanLP.extractKeyword(line.split(",",2)(0),keyWordNum)
      .map(str=>str.filterNot(stopChar.contains(_))).filter(w=>(w.length>1 || ( w.length==1 && white_single_word.contains(w(0))) ))
      .filterNot(stopWords.contains(_)).take(keyWordNum).distinct))
  ret
}

//調用

raw_data.repartition(100).mapPartitions(myfuncPerPartition_)

Spark中分布式使用HanLP（1.7.0)分詞示例

繼續閱讀

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

maven No compiler is provided in this environment. Perhaps you are running on a JRE rather than a J

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

Opendaylight課堂之深度剖析toaster（一）