Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

前面寫過 Apache Lucene 5.x版本示例，為了支援中文分詞，我們可以使用中文分詞庫 IKAnalyzer。

由于IKAnalyzer使用的是4.x版本的Analyzer接口，該接口和5.x版本不相容，是以，如果想要在5.x版本中使用IKAnalyzer，我們還需要自己來實作5.x版本的接口。

通過看源碼，發現需要修改兩個接口的類。

第一個是

Tokenizer

接口，我們寫一個

IKTokenizer5x

：

/**
 * 支援5.x版本的IKTokenizer
 * 
 * @author
public class IKTokenizer5x extends Tokenizer
    private IKSegmenter _IKImplement;
    private final CharTermAttribute termAtt = (CharTermAttribute)this.addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = (OffsetAttribute)this.addAttribute(OffsetAttribute.class);
    private final TypeAttribute typeAtt = (TypeAttribute)this.addAttribute(TypeAttribute.class);
    private int endPosition;

    public IKTokenizer5x() {
        this._IKImplement = new IKSegmenter(this.input, true);
    }

    public IKTokenizer5x(boolean useSmart) {
        this._IKImplement = new IKSegmenter(this.input, useSmart);
    }

    public IKTokenizer5x(AttributeFactory factory) {
        super(factory);
        this._IKImplement = new IKSegmenter(this.input, true);
    }

    public boolean incrementToken() throws IOException {
        this.clearAttributes();
        Lexeme nextLexeme = this._IKImplement.next();
        if(nextLexeme != null) {
            this.termAtt.append(nextLexeme.getLexemeText());
            this.termAtt.setLength(nextLexeme.getLength());
            this.offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
            this.endPosition = nextLexeme.getEndPosition();
            this.typeAtt.setType(nextLexeme.getLexemeTypeString());
            return true;
        } else {
            return false;
        }
    }

    public void reset() throws IOException {
        super.reset();
        this._IKImplement.reset(this.input);
    }

    public final void end() {
        int finalOffset = this.correctOffset(this.endPosition);
        this.offsetAtt.setOffset(finalOffset, finalOffset);
    }
}

該類隻是在

IKTokenizer

基礎上做了簡單修改，和原方法相比修改了

public IKTokenizer(Reader in, boolean useSmart)

這個構造方法，不在需要

Reader

參數。

另一個接口就是

Analyzer

的

IKAnalyzer5x

/**
 * 支援5.x版本的IKAnalyzer
 * 
 * @author
public class IKAnalyzer5x extends Analyzer

    private boolean useSmart;

    public boolean useSmart() {
        return this.useSmart;
    }

    public void setUseSmart(boolean useSmart) {
        this.useSmart = useSmart;
    }

    public IKAnalyzer5x() {
        this(false);
    }

    public IKAnalyzer5x(boolean useSmart) {
        this.useSmart = useSmart;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        IKTokenizer5x _IKTokenizer = new IKTokenizer5x(this.useSmart);
        return new

這個類的接口由

protected TokenStreamComponents createComponents(String fieldName, Reader in)

變成了

protected TokenStreamComponents createComponents(String fieldName)

方法的實作中使用了上面建立的

IKTokenizer5x

。

定義好上面的類後，在Lucene中使用

IKAnalyzer5x

即可。

針對

IKAnalyzer5x

我們寫個簡單測試：

/**
 * IKAnalyzer5x 測試
 *
 * @author
public class IKAnalyzer5xTest
    public static void main(String[] args) throws IOException {
        Analyzer analyzer = new IKAnalyzer5x(true);
        TokenStream ts = analyzer.tokenStream("field",
                new StringReader(
                    "IK Analyzer 是一個開源的，基于java語言開發的輕量級的中文分詞工具包。" +
                    "從2006年12月推出1.0版開始， IKAnalyzer已經推出了4個大版本。" +
                    "最初，它是以開源項目Luence為應用主體的，" +
                    "結合詞典分詞和文法分析算法的中文分詞元件。從3.0版本開始，" +
                    "IK發展為面向Java的公用分詞元件，獨立于Lucene項目，" +
                    "同時提供了對Lucene的預設優化實作。在2012版本中，" +
                    "IK實作了簡單的分詞歧義排除算法，" +
                    "标志着IK分詞器從單純的詞典分詞向模拟語義分詞衍化。"));

        OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
        try {
            ts.reset();
            while (ts.incrementToken()) {
                System.out.println(offsetAtt.toString());
            }
            ts.end();
        } finally

輸出結果:

Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

繼續閱讀

Ajax學習--網址備忘錄

No handler for type [text] declared on field [content]

Driver Development for Windows 64-bit.

開放源代碼搜尋引擎

轉：基于lucene實作自己的推薦引擎

基于LUCENE實作自己的推薦引擎

Lucene.net和盤古分詞使用小結

graylog 索引模版處理

JFLex使用者手冊中文版安裝與配置運作JFLEX 配置檔案編寫

13 個你應該知道的 Webpack 優化技巧

svn配置權限

MySQL和Lucene索引對比分析1. MySQL索引實作2. Lucene索引實作3. MySQL與Lucence對比參考：

ElasticSearch:內建IK分詞器以及基本使用

Lucence的基本原理

lucene 關鍵字高亮

專家訪談：搜尋開源力量：Lucene技術前景