Lucene5學習之使用MMSeg4j分詞器

mmseg4j是一款中文分詞器，詳細介紹如下：

2、mmseg 算法有兩種分詞方法：simple和complex，都是基于正向最大比對。complex 加了四個規則過慮。官方說：詞語的正确識别率達到了 98.41%。mmseg4j 已經實作了這兩種分詞算法。

1.5版的分詞速度simple算法是 1100kb/s左右、complex算法是 700kb/s左右，（測試機：amd athlon 64 2800+ 1g記憶體 xp）。

1.6版在complex基礎上實作了最多分詞(max-word)。“很好聽” -> "很好|好聽"; “中華人民共和國” -> "中華|華人|共和|國"; “中國人民銀行” -> "中國|人民|銀行"。

1.7-beta 版, 目前 complex 1200kb/s左右, simple 1900kb/s左右, 但記憶體開銷了50m左右. 上幾個版都是在10m左右

可惜的是，mmseg4j最新版1.9.1不支援lucene5.0,于是我就修改了它的源碼将它更新咯，使其支援lucene5.x,至于我是怎樣修改，這裡就不一一說明的，我把我修改過的mmseg4j最新源碼上傳到了我的百度網盤，現分享給你們咯：

下面是一個mmseg4j分詞器簡單使用示例：

package com.chenlb.mmseg4j.analysis;

import java.io.ioexception;

import org.apache.lucene.analysis.analyzer;

import org.apache.lucene.analysis.tokenstream;

import org.apache.lucene.analysis.tokenattributes.chartermattribute;

import org.apache.lucene.analysis.tokenattributes.offsetattribute;

import org.apache.lucene.analysis.tokenattributes.positionincrementattribute;

import org.apache.lucene.analysis.tokenattributes.typeattribute;

import org.junit.assert;

import org.junit.before;

import org.junit.ignore;

import org.junit.test;

/**

* mmseganalyzer分詞器測試

* @author lanxiaowei

public class mmseganalyzertest {

string txt = "";

@before

public void before() throws exception {

txt = "京華時報２００９年1月23日報道昨天，受一股來自中西伯利亞的強冷空氣影響，本市出現大風降溫天氣，白天最高氣溫隻有零下7攝氏度，同時伴有6到7級的偏北風。";

txt = "２００９年ゥスぁま是中ＡＢｃｃ國абвгαβγδ首次,我的ⅠⅡⅢ在chenёlbēū全國ㄦ範圍ㄚㄞㄢ内①ē②㈠㈩⒈⒑發行地方政府債券，";

txt = "大s小3u盤浙bu盤t恤t台a股牛b";

}

@test

//@ignore

public void testsimple() throws ioexception {

analyzer analyzer = new simpleanalyzer();

displaytokens(analyzer,txt);

@ignore

public void testcomplex() throws ioexception {

//txt = "1999年12345日報道了一條新聞,2000年中法國足球比賽";

/*txt = "第一卷雲天落日圓第一節偷歡不成倒大黴";

txt = "中國人民銀行";

txt = "我們";

txt = "工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作";*/

//complexseg.setshowchunk(true);

analyzer analyzer = new complexanalyzer();

public void testmaxword() throws ioexception {

//txt = "第一卷雲天落日圓第一節偷歡不成倒大黴";

//txt = "中國人民銀行";

//txt = "下一個為什麼";

//txt = "我們家門前的大水溝很難過";

analyzer analyzer = new maxwordanalyzer();

/*@test

public void testcutleeterdigitfilter() {

string mytxt = "mb991ch cq40-519tx mmseg4j ";

list<string> words = towords(mytxt, new mmseganalyzer("") {

@override

protected tokenstreamcomponents createcomponents(string text) {

reader reader = new bufferedreader(new stringreader(text));

tokenizer t = new mmsegtokenizer(newseg(), reader);

return new tokenstreamcomponents(t, new cutletterdigitfilter(t));

}

});

//assert.assertarrayequals("cutleeterdigitfilter fail", words.toarray(new string[words.size()]), "mb 991 ch cq 40 519 tx mmseg 4 j".split(" "));

for(string word : words) {

system.out.println(word);

}

}*/

public static void displaytokens(analyzer analyzer,string text) throws ioexception {

tokenstream tokenstream = analyzer.tokenstream("text", text);

displaytokens(tokenstream);

public static void displaytokens(tokenstream tokenstream) throws ioexception {

offsetattribute offsetattribute = tokenstream.addattribute(offsetattribute.class);

positionincrementattribute positionincrementattribute = tokenstream.addattribute(positionincrementattribute.class);

chartermattribute chartermattribute = tokenstream.addattribute(chartermattribute.class);

typeattribute typeattribute = tokenstream.addattribute(typeattribute.class);

tokenstream.reset();

int position = 0;

while (tokenstream.incrementtoken()) {

int increment = positionincrementattribute.getpositionincrement();

if(increment > 0) {

position = position + increment;

system.out.print(position + ":");

}

int startoffset = offsetattribute.startoffset();

int endoffset = offsetattribute.endoffset();

string term = chartermattribute.tostring();

system.out.println("[" + term + "]" + ":(" + startoffset + "-->" + endoffset + "):" + typeattribute.type());

}

/**

* 斷言分詞結果

* @param analyzer

* @param text 源字元串

* @param expecteds 期望分詞後結果

* @throws ioexception

public static void assertanalyzerto(analyzer analyzer,string text,string[] expecteds) throws ioexception {

for(string expected : expecteds) {

assert.asserttrue(tokenstream.incrementtoken());

assert.assertequals(expected, chartermattribute.tostring());

assert.assertfalse(tokenstream.incrementtoken());

tokenstream.close();

}

mmseg4j分詞器有3個字典檔案，如圖：

chars.dic是漢字字典檔案，裡面有12638個漢字

units.dic裡是中文機關詞語，如小時，分鐘，米，厘米等等，具體自己打開看看就明白了

words.dic就是使用者自定義字典檔案，比如：麼麼哒，t恤，牛b等這些詞，放在這個字典檔案裡，分詞器就能把它當作一個詞

我們在使用mmseg4j分詞器時，是這樣用的：

analyzer analyzer = new simpleanalyzer();

檢視simpleanalyzer的構造函數，

public simpleanalyzer() {

super();

調用的是父類mmseganalyzer的無參構造函數，接着檢視mmseganalyzer類的無參構造函數：

public mmseganalyzer() {

dic = dictionary.getinstance();

你會發現是通過dictionary.getinstance()單執行個體模式去加載字典檔案的，接着檢視getinstance方法，

這裡的代碼注釋寫的很清楚，告訴了我們字典檔案的加載邏輯。

file path = getdefalutpath();用來擷取預設的字典檔案路徑，

然後根據字典檔案路徑調用getinstance(path)方法去加載字典檔案，接着檢視該方法，

先從緩存dics裡去字典檔案，如果緩存裡沒有找到，則才會根據字典檔案路徑去加載，然後把加載到的字典檔案放入緩存dics即dics.put()，

接着看看dictionary字典是如何初始化的，檢視dictionary的構造函數源碼：

你會發現内部實際是通過調用init(path);方法進行字典初始化的，繼續查閱init方法，

内部又是調用的reload方法加載的字典，繼續跟蹤至reload方法，

内部通過loaddic去加載words和chars兩個字典檔案，通過loadunit方法去加載units字典檔案，wordslasttime是用來存放每個字典檔案的最後一次修改時間，引入這個map的目的是為了實作字典檔案重新加載，通過字典檔案的最後一次修改時間來判定檔案是否修改過，如果這個map裡不存在某字典檔案的最後一次修改時間，則表明該字典檔案是新加入的，需要重新加載至記憶體，這是loaddic方法的源碼：

private map<character, charnode> loaddic(file wordspath) throws ioexception {

inputstream charsin = null;

file charsfile = new file(wordspath, "chars.dic");

if(charsfile.exists()) {

charsin = new fileinputstream(charsfile);

addlasttime(charsfile); //chars.dic 也檢測是否變更

} else { //從 jar 裡加載

charsin = this.getclass().getresourceasstream("/data/chars.dic");

charsfile = new file(this.getclass().getresource("/data/chars.dic").getfile()); //only for log

final map<character, charnode> dic = new hashmap<character, charnode>();

int linenum = 0;

long s = now();

long ss = s;

linenum = load(charsin, new fileloading() { //單個字的

public void row(string line, int n) {

if(line.length() < 1) {

return;

}

string[] w = line.split(" ");

charnode cn = new charnode();

switch(w.length) {

case 2:

try {

cn.setfreq((int)(math.log(integer.parseint(w[1]))*100));//字頻計算出自由度

} catch(numberformatexception e) {

//eat...

}

case 1:

dic.put(w[0].charat(0), cn);

});

log.info("chars loaded time="+(now()-s)+"ms, line="+linenum+", on file="+charsfile);

//try load words.dic in jar

inputstream wordsdicin = this.getclass().getresourceasstream("/data/words.dic");

if(wordsdicin != null) {

file wordsdic = new file(this.getclass().getresource("/data/words.dic").getfile());

loadword(wordsdicin, dic, wordsdic);

file[] words = listwordsfiles(); //隻要 wordsxxx.dic的檔案

if(words != null) { //擴充詞庫目錄

for(file wordsfile : words) {

loadword(new fileinputstream(wordsfile), dic, wordsfile);

addlasttime(wordsfile); //用于檢測是否修改

log.info("load all dic use time="+(now()-ss)+"ms");

return dic;

大緻邏輯就是先加載chars.dic再加載words.dic,最後加載使用者自定義字典檔案，注意使用者自定義字典檔案命名需要以words開頭且檔案名字尾必須為.dic，查找所有使用者自定義字典檔案是這句代碼：

file[] words = listwordsfiles();

注意：dicpath.listfiles表示查找dicpath目錄下所有檔案，dicpath即我們的words.dic字典檔案的所在路徑，而重載的accept的意思我想大家都懂的，關鍵點我用紅色方框标注出來了，這句代碼意思就是查找words.dic字典檔案所在檔案夾下的以words開頭的dic字典檔案，包含子檔案夾裡的字典檔案(即遞歸查找，你懂的)。看到這裡，我想至于如何自定義使用者自定義字典檔案，大家都不言自明了。為了照顧小白，我還是說清楚點吧，自定義使用者字典檔案方法步驟如下：

mmseg4j就說這麼多了吧，mmseg4j我修改過的最新源碼上面有貼出百度網盤下載下傳位址，自己去下載下傳，jar包在target目錄下，如圖：

從我提供的下載下傳位址下載下傳的最新源碼包裡有打包好的jar包，如圖去找就行了，當然為了友善你們，我待會兒也會在底下的附件裡将打包的jar包上傳上去。

ok，打完收工！！！！如果你還有什麼問題，請qq上聯系我(qq:7-3-6-0-3-1-3-0-5),或者加我的java技術群跟我們一起交流學習，我會非常的歡迎的。群号：

轉載：http://iamyida.iteye.com/blog/2207633

Lucene5學習之使用MMSeg4j分詞器

繼續閱讀

資料結構與算法（27）——排序（二）

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

hdu7108哈希