Lucene5学习之使用MMSeg4j分词器

mmseg4j是一款中文分词器，详细介绍如下：

2、mmseg 算法有两种分词方法：simple和complex，都是基于正向最大匹配。complex 加了四个规则过虑。官方说：词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法。

1.5版的分词速度simple算法是 1100kb/s左右、complex算法是 700kb/s左右，（测试机：amd athlon 64 2800+ 1g内存 xp）。

1.6版在complex基础上实现了最多分词(max-word)。“很好听” -> "很好|好听"; “中华人民共和国” -> "中华|华人|共和|国"; “中国人民银行” -> "中国|人民|银行"。

1.7-beta 版, 目前 complex 1200kb/s左右, simple 1900kb/s左右, 但内存开销了50m左右. 上几个版都是在10m左右

可惜的是，mmseg4j最新版1.9.1不支持lucene5.0,于是我就修改了它的源码将它升级咯，使其支持lucene5.x,至于我是怎样修改，这里就不一一说明的，我把我修改过的mmseg4j最新源码上传到了我的百度网盘，现分享给你们咯：

下面是一个mmseg4j分词器简单使用示例：

package com.chenlb.mmseg4j.analysis;

import java.io.ioexception;

import org.apache.lucene.analysis.analyzer;

import org.apache.lucene.analysis.tokenstream;

import org.apache.lucene.analysis.tokenattributes.chartermattribute;

import org.apache.lucene.analysis.tokenattributes.offsetattribute;

import org.apache.lucene.analysis.tokenattributes.positionincrementattribute;

import org.apache.lucene.analysis.tokenattributes.typeattribute;

import org.junit.assert;

import org.junit.before;

import org.junit.ignore;

import org.junit.test;

/**

* mmseganalyzer分词器测试

* @author lanxiaowei

public class mmseganalyzertest {

string txt = "";

@before

public void before() throws exception {

txt = "京华时报２００９年1月23日报道昨天，受一股来自中西伯利亚的强冷空气影响，本市出现大风降温天气，白天最高气温只有零下7摄氏度，同时伴有6到7级的偏北风。";

txt = "２００９年ゥスぁま是中ＡＢｃｃ国абвгαβγδ首次,我的ⅠⅡⅢ在chenёlbēū全国ㄦ范围ㄚㄞㄢ内①ē②㈠㈩⒈⒑发行地方政府债券，";

txt = "大s小3u盘浙bu盘t恤t台a股牛b";

}

@test

//@ignore

public void testsimple() throws ioexception {

analyzer analyzer = new simpleanalyzer();

displaytokens(analyzer,txt);

@ignore

public void testcomplex() throws ioexception {

//txt = "1999年12345日报道了一条新闻,2000年中法国足球比赛";

/*txt = "第一卷云天落日圆第一节偷欢不成倒大霉";

txt = "中国人民银行";

txt = "我们";

txt = "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作";*/

//complexseg.setshowchunk(true);

analyzer analyzer = new complexanalyzer();

public void testmaxword() throws ioexception {

//txt = "第一卷云天落日圆第一节偷欢不成倒大霉";

//txt = "中国人民银行";

//txt = "下一个为什么";

//txt = "我们家门前的大水沟很难过";

analyzer analyzer = new maxwordanalyzer();

/*@test

public void testcutleeterdigitfilter() {

string mytxt = "mb991ch cq40-519tx mmseg4j ";

list<string> words = towords(mytxt, new mmseganalyzer("") {

@override

protected tokenstreamcomponents createcomponents(string text) {

reader reader = new bufferedreader(new stringreader(text));

tokenizer t = new mmsegtokenizer(newseg(), reader);

return new tokenstreamcomponents(t, new cutletterdigitfilter(t));

}

});

//assert.assertarrayequals("cutleeterdigitfilter fail", words.toarray(new string[words.size()]), "mb 991 ch cq 40 519 tx mmseg 4 j".split(" "));

for(string word : words) {

system.out.println(word);

}

}*/

public static void displaytokens(analyzer analyzer,string text) throws ioexception {

tokenstream tokenstream = analyzer.tokenstream("text", text);

displaytokens(tokenstream);

public static void displaytokens(tokenstream tokenstream) throws ioexception {

offsetattribute offsetattribute = tokenstream.addattribute(offsetattribute.class);

positionincrementattribute positionincrementattribute = tokenstream.addattribute(positionincrementattribute.class);

chartermattribute chartermattribute = tokenstream.addattribute(chartermattribute.class);

typeattribute typeattribute = tokenstream.addattribute(typeattribute.class);

tokenstream.reset();

int position = 0;

while (tokenstream.incrementtoken()) {

int increment = positionincrementattribute.getpositionincrement();

if(increment > 0) {

position = position + increment;

system.out.print(position + ":");

}

int startoffset = offsetattribute.startoffset();

int endoffset = offsetattribute.endoffset();

string term = chartermattribute.tostring();

system.out.println("[" + term + "]" + ":(" + startoffset + "-->" + endoffset + "):" + typeattribute.type());

}

/**

* 断言分词结果

* @param analyzer

* @param text 源字符串

* @param expecteds 期望分词后结果

* @throws ioexception

public static void assertanalyzerto(analyzer analyzer,string text,string[] expecteds) throws ioexception {

for(string expected : expecteds) {

assert.asserttrue(tokenstream.incrementtoken());

assert.assertequals(expected, chartermattribute.tostring());

assert.assertfalse(tokenstream.incrementtoken());

tokenstream.close();

}

mmseg4j分词器有3个字典文件，如图：

chars.dic是汉字字典文件，里面有12638个汉字

units.dic里是中文单位词语，如小时，分钟，米，厘米等等，具体自己打开看看就明白了

words.dic就是用户自定义字典文件，比如：么么哒，t恤，牛b等这些词，放在这个字典文件里，分词器就能把它当作一个词

我们在使用mmseg4j分词器时，是这样用的：

analyzer analyzer = new simpleanalyzer();

查看simpleanalyzer的构造函数，

public simpleanalyzer() {

super();

调用的是父类mmseganalyzer的无参构造函数，接着查看mmseganalyzer类的无参构造函数：

public mmseganalyzer() {

dic = dictionary.getinstance();

你会发现是通过dictionary.getinstance()单实例模式去加载字典文件的，接着查看getinstance方法，

这里的代码注释写的很清楚，告诉了我们字典文件的加载逻辑。

file path = getdefalutpath();用来获取默认的字典文件路径，

然后根据字典文件路径调用getinstance(path)方法去加载字典文件，接着查看该方法，

先从缓存dics里去字典文件，如果缓存里没有找到，则才会根据字典文件路径去加载，然后把加载到的字典文件放入缓存dics即dics.put()，

接着看看dictionary字典是如何初始化的，查看dictionary的构造函数源码：

你会发现内部实际是通过调用init(path);方法进行字典初始化的，继续查阅init方法，

内部又是调用的reload方法加载的字典，继续跟踪至reload方法，

内部通过loaddic去加载words和chars两个字典文件，通过loadunit方法去加载units字典文件，wordslasttime是用来存放每个字典文件的最后一次修改时间，引入这个map的目的是为了实现字典文件重新加载，通过字典文件的最后一次修改时间来判定文件是否修改过，如果这个map里不存在某字典文件的最后一次修改时间，则表明该字典文件是新加入的，需要重新加载至内存，这是loaddic方法的源码：

private map<character, charnode> loaddic(file wordspath) throws ioexception {

inputstream charsin = null;

file charsfile = new file(wordspath, "chars.dic");

if(charsfile.exists()) {

charsin = new fileinputstream(charsfile);

addlasttime(charsfile); //chars.dic 也检测是否变更

} else { //从 jar 里加载

charsin = this.getclass().getresourceasstream("/data/chars.dic");

charsfile = new file(this.getclass().getresource("/data/chars.dic").getfile()); //only for log

final map<character, charnode> dic = new hashmap<character, charnode>();

int linenum = 0;

long s = now();

long ss = s;

linenum = load(charsin, new fileloading() { //单个字的

public void row(string line, int n) {

if(line.length() < 1) {

return;

}

string[] w = line.split(" ");

charnode cn = new charnode();

switch(w.length) {

case 2:

try {

cn.setfreq((int)(math.log(integer.parseint(w[1]))*100));//字频计算出自由度

} catch(numberformatexception e) {

//eat...

}

case 1:

dic.put(w[0].charat(0), cn);

});

log.info("chars loaded time="+(now()-s)+"ms, line="+linenum+", on file="+charsfile);

//try load words.dic in jar

inputstream wordsdicin = this.getclass().getresourceasstream("/data/words.dic");

if(wordsdicin != null) {

file wordsdic = new file(this.getclass().getresource("/data/words.dic").getfile());

loadword(wordsdicin, dic, wordsdic);

file[] words = listwordsfiles(); //只要 wordsxxx.dic的文件

if(words != null) { //扩展词库目录

for(file wordsfile : words) {

loadword(new fileinputstream(wordsfile), dic, wordsfile);

addlasttime(wordsfile); //用于检测是否修改

log.info("load all dic use time="+(now()-ss)+"ms");

return dic;

大致逻辑就是先加载chars.dic再加载words.dic,最后加载用户自定义字典文件，注意用户自定义字典文件命名需要以words开头且文件名后缀必须为.dic，查找所有用户自定义字典文件是这句代码：

file[] words = listwordsfiles();

注意：dicpath.listfiles表示查找dicpath目录下所有文件，dicpath即我们的words.dic字典文件的所在路径，而重载的accept的意思我想大家都懂的，关键点我用红色方框标注出来了，这句代码意思就是查找words.dic字典文件所在文件夹下的以words开头的dic字典文件，包含子文件夹里的字典文件(即递归查找，你懂的)。看到这里，我想至于如何自定义用户自定义字典文件，大家都不言自明了。为了照顾小白，我还是说清楚点吧，自定义用户字典文件方法步骤如下：

mmseg4j就说这么多了吧，mmseg4j我修改过的最新源码上面有贴出百度网盘下载地址，自己去下载，jar包在target目录下，如图：

从我提供的下载地址下载的最新源码包里有打包好的jar包，如图去找就行了，当然为了方便你们，我待会儿也会在底下的附件里将打包的jar包上传上去。

ok，打完收工！！！！如果你还有什么问题，请qq上联系我(qq:7-3-6-0-3-1-3-0-5),或者加我的java技术群跟我们一起交流学习，我会非常的欢迎的。群号：

转载：http://iamyida.iteye.com/blog/2207633

Lucene5学习之使用MMSeg4j分词器

继续阅读

数据结构与算法（27）——排序（二）

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

Dijkstra--简易版（最短路径）

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method

hdu7108哈希