【Lucene4.8教程之四】分析

2016-04-07 21:07:00

1、基礎内容

（1）相關概念

分析(Analysis)，在Lucene中指的是将域(Field)文本轉換成最主要的索引表示單元--項(Term)的過程。在搜尋過程中，這些項用于決定什麼樣的文檔可以比對查詞條件。

分析器對分析操作進行了封裝，它通過運作若幹操作，将文本轉化成語彙單元，這個處理過程也稱為語彙單元化過程(tokenization)。而從文本洲中提取的文本塊稱為語彙單元(token)。詞彙單元與它的域名結合後，就形成了項。

（2）何時使用分析器

建立索引期間

使用QueryParser對象進行搜尋時

在搜尋中高亮顯示結果時

（3）經常使用的4個分析器：

WhitespaceAnalyzer, as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens.

SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters.

StopAnalyzer is the same as SimpleAnalyzer, except it removes common words (called stop words, described more in section XXX). By default it removes common words in the English language (the, a, etc.), though you can pass in your own set.

StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names,

四、其他内容

在建立IndexWriter時，須要指定分析器，如：

便在每次向writer中加入文檔時。能夠針對該文檔指定一個分析器，如

【Lucene4.8教程之四】分析

繼續閱讀

CSU 1561 (More) Multiplication

CSU 1563 Lexicography

HDU 4721 Food and Productivity

ZOJ 1041 Transmitters

CSU 1562 Fun House

CodeChef PALPROB Palindromeness

UVA 10344- 23 out of 5

ZOJ 1104 Leaps Tall Buildings

HDU 2821 Pusher

UVA 1401 Remember the Word

ZOJ 2748 Free Kick

CSU 1567 Reverse Rot

JAVA 系列——>開發工具IntelliJ IDEA的安裝以及配置、快捷鍵IDEA 簡介

專家訪談：搜尋開源力量：Lucene技術前景

UVA 519 Puzzle (II)

磁盤結構及在Linux中的命名