analysis, in lucene, is the process of converting field text into its most fundamental indexed representation, terms.
這個分析的步驟不一定的, 一般會包含如下:extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization).
choosing the right analyzer is a crucial development decision with lucene. one size doesn’t fit all when it comes to choosing an analyzer. language is one factor in choosing an analyzer,because each has its own unique features. another factor to consider in choosing an analyzer is the domain of the text being analyzed.
那麼google是怎樣分析的, 對于“to be or not to be”不加引号在google裡查詢, 隻有''not''會被google考慮, 其他的詞都被當做common詞丢了, 而google在index的時候确沒有丢掉這些stop words, 因為你加引号在google裡面查詢, 就可以找到有這個短語的文章. 一個有趣的問題是如果對這種海量的stop words進行index的話, google怎麼做才能節省空間, 以不至于out of storage? 後面會讨論
一個例子, 分析同一句, 不用分析器所帶來的不同效果:
analyzing "xy&z corporation - [email protected]"
whitespaceanalyzer:
[xy&z] [corporation] [-] [[email protected]]
simpleanalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
stopanalyzer:
standardanalyzer:
[xy&z] [corporation] [[email protected]]
一. using analyzers
1. indexing analysis
在index裡面用就是這樣用的.
analyzer analyzer = new standardanalyzer();
indexwriter writer = new indexwriter(directory, analyzer, true);
還可以為每個document指定特有的anlysis,
writer.adddocument(doc, analyzer);
2. queryparser analysis
queryparser時再分析輸入的查詢時也需要用到analysis
queryparser parser = new queryparser("contents", analyzer);
query = parser.parse(expression);
或者
query query = queryparser.parse(expression, "contents", analyzer);
二. analyzing the analyzer
那麼要來分析這個analyzer, 它是幹嗎的了, 前面說了, 他就是把文本變成一個token序列, 稱為tokenstream. 是以他應該有這個一個接口:
public tokenstream tokenstream(string fieldname, reader reader)
對于最簡單的simpleanalyzer而言, 他的接口是這樣的:
public final class simpleanalyzer extends analyzer {
public tokenstream tokenstream(string fieldname, reader reader) {
return new lowercasetokenizer(reader);
}
}
1. what’s in a token?
a stream of tokens is the fundamental output of the analysis process.
其實就是把這個text切成一個個小塊, 可以說是分詞拉, 這個每個小塊就是一個token.
token和term有什麼差別了, 先看token
a token 由什麼組成:
text value (the word itself)
start and end offsets in the original text
token type: default is ''word''
position increment:the position relative to the previous token, 離前一個token多遠, 一般預設為1
term 是什麼:
after text is analyzed during indexing, each token is posted to the index as a term.
而term僅僅記錄text value, position increment, 其他的token資訊都扔了.
為什麼都扔了, 非要留position increment?
position increments factor directly into performing phrase queries (see section 3.4.5) and span queries (see section 5.4), which rely on knowing how far terms are from one another within a field.
因為這個在phrase queries和span queries中要用到.
position increments預設為1, 當大于1時, 就是要跳過一些詞, 這些詞因為某種原因不算token, 如stop word.
當等于0時, 這個比較有意思啊, 就是這個詞和上一個是重疊的, 可以用于添加word aliases, 在synonymanalyzer中會用到.
2. tokenstreams
there are two different styles of tokenstreams: tokenizer and tokenfilter.
差別是什麼?
tokenizer的輸入就是text, 他會産生基本的token, 如有chartokenizer, whitespacetokenizer, standardtokenizer
tokenfilter的輸入本身就是tokenstreams, 也就是說他是用來基于已經生成的tokenstreams進行二次處理的, 是進階的. 如有, lowercasefilter, stopfilter, porterstemfilter.
舉個stopanalyzer的例子:
public tokenstream tokenstream(string fieldname, reader reader) {
return new stopfilter(
new lowercasetokenizer(reader), stoptable);
可見stopanalyzer, 就是先用lowercasetokenizer就行初步分詞, 然後再用stopfilter把stop word給濾掉, 兩步走.
這樣把analyzer階層化, 有利于靈活運用, 可以組裝出各種不同用途的analyzer.
3. visualizing analyzers
作者給了個可以列印analyzer分析結果的子產品, 有個這個可以跟好的了解各個analyzer的作用.
"the quick brown fox...." 分析完token的狀況如下:
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]
前面說token變成term的時候會丢掉start and end offsets, token type 是不是說他們就完全沒用了?
不是的, 比如search engine現在都提供term highlighter的功能, 這個就用到term的start and end offsets
而對于token type的作用看下面的例子:
“i’ll e-mail you at [email protected]” with standardanalyzer:
1: [i''ll:0->4:<apostrophe>]
2: [e:5->6:<alphanum>]
3: [mail:7->11:<alphanum>]
4: [you:12->15:<alphanum>]
5: [[email protected]:19->34:<email>]
4. filtering order can be important
前面講了, 這個analyzer分層設計後, 便于靈活的組合使用. 但這個順序是有講究的.
比如stopfilter是大小寫敏感的, 是以如果使用他之前不進行大小寫處理就會有問題, 等.
三. using the built-in analyzers
先介紹如下幾種built-in analyzers:
whitespaceanalyzer: splits tokens at whitespace
simpleanalyzer: divides text at nonletter characters and lowercases
stopanalyzer: divides text at nonletter characters, lowercases, and removes stop words
standardanalyzer: tokenizes based on a sophisticated grammar that recognizes e-mail addresses, acronyms, chinese-japanese-korean characters, alphanumerics, and more; lowercases; and removes stop words
whitespaceanalyzer and simpleanalyzer are both trivial, 就不提了
1. stopanalyzer
其實這個也沒的什麼好講的, stopanalyzer就是在simpleanalyzer基礎上去掉stop word.
這個stop word list, 可以用預設的, 也可以自己指定.
是以在index和query的時候, 所有stop word都是被濾掉的, 你放再多的stop word在文檔裡都沒用.
是以特定stop word的選取很重要, 選的過大會導緻很多語意semantic被丢失, 太小會影響效率.
2. standardanalyzer
這個應該是lucene的主打分析器, 也是最通用的, 他能夠智能的分析出:
alphanumerics, acronyms, company names, e-mail addresses, computer host names, numbers, words with an interior apostrophe, serial numbers, ip addresses, and cjk (chinese japanese korean) characters.
可以說比較強大.
四. dealing with keyword fields
處理keyword類型的field, 這個field有什麼特殊的了, keyword不會被anlysis, 直接整個作為一個term被index.
是以如果象下面這樣query的時候也整個當成一個term, 不會有問題, 能夠search到.
document doc = new document();
doc.add(field.keyword("partnum", "q36"));
doc.add(field.text("description", "illidium space modulator"));
writer.adddocument(doc);
query query = new termquery(new term("partnum", "q36"));
但是如果你不幸用了queryparser, 象下面這樣:
query query = queryparser.parse("partnum:q36 and space", "description", new simpleanalyzer());
就無法search到. 為什麼了?
因為queryparser analyzes each term and phrase, 他比較勤勞, 每一個term和phrase都要去分析, 這邊用的是simpleanalyzer.
是以simpleanalyzer strips nonletter characters and lowercases, so q36 becomes q.
這兒可見indexing and analysis are intimately tied to searching, indexing和analysis跟search息息相關, 一步不對你就屁都找不到.
但這個問題應該是queryparser的一個issue, 這兒不應該analysis的地方, 他非要去analysis.
怎麼解決了:
a.不讓使用者随便寫query, 通過ui讓使用者選擇, 這個好像不能算解決
b.找個方法可以field-specific analysis
c.creating a custom domain specific analyzer, 比如分析"q36", 完了還是"q36", 别成''q''了.
d.子類化queryparser來重載getfieldquery來提供field-specific handling
我們這兒就采用一個(b)的方案來解決這個問題,
在index篇中我們提到, 我們可以針對indexwriter對象和document來指定具體的analyzer, 用普通的分析器和方法無法支援對field指定特殊的analyzer.
是以我們就要特殊的方法: perfieldanalyzerwrapper 從名字就能看出來per field
the built-in perfieldanalyzerwrapper constructor requires the default analyzer as a parameter.
perfieldanalyzerwrapper analyzer = new perfieldanalyzerwrapper(new simpleanalyzer());
to assign a different analyzer to a field, use the addanalyzer method.
analyzer.addanalyzer("partnum", new keywordanalyzer());
during tokenization, the analyzer specific to the field name is used; the default is used if no field-specific analyzer has been assigned.
query query = queryparser.parse("partnum:q36 and space", "description", analyzer);
這樣就解決了上面的問題. 裡面有個keywordanalyzer, 這兒給出2種實作:
比較複雜的, 通用一點的實作:
public class keywordanalyzer extends analyzer {
public tokenstream tokenstream(string fieldname, final reader reader) {
return new tokenstream() {
private boolean done;
private final char[] buffer = new char[1024];
public token next() throws ioexception {
if (!done) {
done = true;
stringbuffer buffer = new stringbuffer();
int length = 0;
while (true) {
length = reader.read(this.buffer);
if (length == -1) break;
buffer.append(this.buffer, 0, length);
}
string text = buffer.tostring();
return new token(text, 0, text.length());
}
return null;
};
說白了就是overload了token疊代器的next函數, 第一次讀出所有的内容當一個token傳回, 第二次傳回null.
這個方法可以通過調節buffer的大小來設定keyword的最大長度.
比較簡單的方法, 不過keyword最長255:
public class simplekeywordanalyzer extends analyzer {
return new chartokenizer(reader) {
protected boolean istokenchar(char c) {
return true;
};
這用了chartokenizer, 這個分析器就是當istokenchar == true時, 會一直連續讀下去, 直到false才認為一個token結束.
但他有上限, token最大就255. 這兒讓他一直為true, 是以就是讀全部.
這個chartokenizer還挺重要, whitespacetokenizer, lettertokenizer都是拿它來實作的, whitespacetokenizer僅當是whitespace時傳回false, 而lettertokenizer傳回character.isletter.
五. “sounds like” querying
這個很有意思, 什麼意思?直接看例子
用metaphone分析器, 它可以傳回字元串的讀音值
analyzer analyzer = new metaphonereplacementanalyzer();
doc.add(field.text("contents", "cool cat"));
writer.close();
indexsearcher searcher = new indexsearcher(directory);
query query = queryparser.parse("kool kat", "contents", analyzer);
結果是找到上面的文章了, 為什麼, 因為"cool cat"和"kool kat"讀音一樣, 至于這個metaphonereplacementanalyzer的詳細實作, 書上有, 不詳細講了
比如下面兩句經過分析:
"the quick brown fox jumped over the lazy dogs"
"tha quik brown phox jumpd ovvar tha lazi dogz"
輸出的讀音值都是:
[0] [kk] [brn] [fks] [jmpt] [ofr] [0] [ls] [tks]
這個有什麼用了?
a sounds-like feature would be great for situations where a user misspelled every word and no documents were found, but alternative words could be suggested.
one implementation approach to this idea could be to run all text through a sounds-like analysis and build a cross-reference lookup to consult when a correction is needed.
這個還是蠻有用的, search engine你拼錯時給的提示, 可能就是這麼實作的. 這種糾錯也可以通過bayes分類,機器學習的方法來解決, 不過貌似更複雜點.
六. synonyms, aliases, and words that mean the same
同義詞也是用的很多的, 一個例子分析"jumps"
token[] tokens = analyzerutils.tokensfromanalysis(synonymanalyzer, "jumps");
一般token應該就是"jumps", 可現在是{"jumps", "hops", "leaps"}
analyzerutils.asserttokensequal(tokens, new string[] {"jumps", "hops", "leaps"});
而且"hops", "leaps"的positionincrement為0, 這個在前面已經提到.
assertequals("jumps", 1, tokens[0].getpositionincrement());
assertequals("hops", 0, tokens[1].getpositionincrement());
assertequals("leaps", 0, tokens[2].getpositionincrement());
好,來看看這個synonymanalyzer是怎麼實作的?
tokenstream result = new synonymfilter(
new stopfilter(
new lowercasefilter(
new standardfilter(
new standardtokenizer(reader))),
standardanalyzer.stop_words),
engine
);
這個例子可以很好的看出lucene analyzer的靈活性, 從standardtokenizer一步步上去, 最後到了synonymfilter, 會根據你給的engine往tokenstream裡面加同義token.
synonymfilter的實作可以看書上, 其實很簡單, 就是一個個token周遊, 對每個token去engine裡面查同義詞, 有同義詞也封裝成token插進去, 同時将positionincrement置0
string[] synonyms = engine.getsynonyms(token.termtext());
if (synonyms == null) return;
for (int i = 0; i < synonyms.length; i++) {
token syntoken = new token(synonyms[i],
token.startoffset(),
token.endoffset(),
token_type_synonym);
syntoken.setpositionincrement(0);
下面舉個例子來說明synonymanalyzer的效果:
indexwriter writer = new indexwriter(directory, synonymanalyzer, true);
doc.add(field.text("content", "the quick brown fox jumps over the lazy dogs"));
搜同義詞: jumps,hops同意
termquery tq = new termquery(new term("content", "hops"));
搜短語:
phrasequery pq = new phrasequery();
pq.add(new term("content", "fox"));
pq.add(new term("content", "hops"));
都能比對
同樣當你不幸使用了queryparser, 可能會有奇怪的問題......天拉, 沒事别用queryparser
public void testwithqueryparser() throws exception {
query query = queryparser.parse("/"fox jumps/"", "content", synonymanalyzer);
hits hits = searcher.search(query);
assertequals("!!!! what?!", 0, hits.length());找不到
query = queryparser.parse("/"fox jumps/"", "content", new standardanalyzer());
hits = searcher.search(query);
assertequals("*whew*", 1, hits.length());找到
為什麼queryparser用和index一樣的synonymanalyzer時, 确找不到正确的文檔了...?
當你把這個query print出來時, 你就明白了
system.out.println("/"fox jumps/" parses to " + query.tostring("content"));
"fox jumps" parses to "fox jumps hops leaps"
it glues all terms from analysis together to form a phrasequery and ignores tokenposition increment information.
奧, 你看你隻要用synonymanalyzer, 就會插同義詞, 是以這兒不能再用synonymanalyzer了
難道這又是個issue, 作為token或term, positionincrement資訊都會保留的, 這邊怎麼會簡單粗暴的把所有詞合一塊...
you have another option with synonyms: expanding them into each query rather than indexing.
也就是說對同義詞的處理有2個思路, 一個是我們講的在index的時候把同義詞都加進去, 當然還有一個就是index的時候不做特别的事, 但在query的時候, 把同義詞都列出來, 把含有同義詞的文章也搜出來.
比較笨的方法用phraseprefixquery (see section 5.2) is one option to consider
或者可以created through an overridden queryparser.getfieldquery method
七. stemming analysis
就是要把the various forms of a word are reduced to a common root form.
這裡介紹一種分析器 positionalporterstopanalyzer, this analyzer removes stop words, leaving positional holes where words are removed, and also leverages a stemming filter.
這個分析器基于porter stemming algorithm, 由dr. martin porter發明的, 當然還有其他的用于stemming分析的算法, 比如snowball algorithm, kstem算法
這個分析器不但會把派生詞變成基本的形式, 而且會去掉stop word, 這樣原來放stop word的地方就空了, 變成hole.
比如 "the quick brown fox jumps over the lazy dogs", 分析完:
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]
1和7的位置就是空的.
實作這個功能用到了positionalstopfilter:
public class positionalstopfilter extends tokenfilter {
private set stopwords;
public positionalstopfilter(tokenstream in, set stopwords) {
super(in);
this.stopwords = stopwords;
public final token next() throws ioexception {
int increment = 0;
for (token token = input.next(); token != null; token = input.next()) {
if (!stopwords.contains(token.termtext())) {
token.setpositionincrement(token.getpositionincrement() + increment);
return token;
increment++;
}
return null;
可見就是由幾個stop word, 就把positionincrement加幾.
而positionalporterstopanalyzer的實作如下:
return new porterstemfilter(
new positionalstopfilter(new lowercasetokenizer(reader),stopwords)
);
就是在positionalstopfilter外面加個porterstemfilter.
...不幸又發生了, 當用phrasequery and queryparser會又問題, 這個好像是介紹完每個分析器後面的都會加上點issue...
indexwriter writer =
new indexwriter(directory, porteranalyzer, true);
doc.add(field.text("contents", "the quick brown fox jumps over the lazy dogs"));
query query = queryparser.parse("/"over the lazy/"", "contents", porteranalyzer);
結果是找不到, 明明有為什麼找不到了?
the difficulty lies deeper inside phrasequery and its current inability to deal with positional gaps.
就是說phrasequery處理不了positional gaps, 是以就直接把the扔了, 變成"over lazy",就找不到了
除非你手工的設這個slop, 就是間隔.
queryparser parser = new queryparser("contents", porteranalyzer);
parser.setphraseslop(1);
query query = parser.parse("/"over the lazy/"");
好了找到了, 艾杯具啊......
是以隻能進行“over lazy”這樣的inexact phrases搜尋, 而with stop-word removal in analysis, doing exact phrase matches is, by definition, not possible: the words removed aren’t there, so you can’t know what they were.
是啊stopword都去掉了, 你也不知道slop應該設幾啊.
除了這個小問題,其他這個分析器還是很好用的, 比如你搜"laziness", "/"fox jumped/""都能夠正确找到.
八. language analysis issues
i18n問題對于開發人員永遠是個頭疼的問題, 尤其是web開發, 網上什麼都有啊, 艾暈......
對lucene而言, lucene stores all characters in the standard utf-8 encoding.
是以對于開發者而言, 你必須保證的是必須要給出index的文檔的正确的encoding資訊, 這樣lucene才能正确的将你的文檔轉成utf-8, 否則要麼失敗, 要麼亂碼.
beyond the built-in analyzers we’ve discussed, the core lucene distribution provides two language-specific analyzers: germananalyzer and russiananalyzer.
對于亞洲的語言, 這個才是重點.
the only built-in analyzer capable of doing anything useful with asian text is the standard-analyzer.
however, two analyzers in the lucene sandbox are suitable for asian language analysis, cjkanalyzer and chineseanalyzer.
''道德經''經過不同的分析器, 産生的term如下:
simple-analyzer [道德經]
standard-analyzer[道][德][經]
cjkanalyzer [道德][德經]
chineseanalyzer[道][德][經]
the cjkanalyzer pairs characters in overlapping windows of two characters each.many cjk words are two characters. by pairing characters in this manner, words are likely to be kept together (as well as disconnected characters, increasing the index size).
九. nutch analysis
下面就是對nutch的analysis進行一點介紹,前面說了google對于stop word在index的時候并沒有扔掉, 因為在精确phrase查詢時要用到.但是對于大量的stop word在index時應該怎麼處理了?
google怎麼做的不得而知, 可是開源的nutch給出了自己的solution.
nutch combines an index-time analysis bigram (grouping two consecutive words as a single token) technique with a query-time optimization of phrases.
比如對于“the quick brown…”, 去index “the quick" 而非"the", 因為這種bigram出現的次數要遠遠小于單獨的stop word.
1: [the:<word>] [the-quick:gram]
2: [quick:<word>]
3: [brown:<word>]
4: [fox:<word>]
index完的token是這個樣的, bigram和stop在同一個位置上.
本文章摘自部落格園,原文釋出日期:2011-07-04