1、基礎内容
(1)相關概念
分析(Analysis),在Lucene中指的是将域(Field)文本轉換成最主要的索引表示單元--項(Term)的過程。在搜尋過程中,這些項用于決定什麼樣的文檔可以比對查詞條件。
分析器對分析操作進行了封裝,它通過運作若幹操作,将文本轉化成語彙單元,這個處理過程也稱為語彙單元化過程(tokenization)。而從文本洲中提取的文本塊稱為語彙單元(token)。詞彙單元與它的域名結合後,就形成了項。
(2)何時使用分析器
建立索引期間
使用QueryParser對象進行搜尋時
在搜尋中高亮顯示結果時
(3)經常使用的4個分析器:
WhitespaceAnalyzer, as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens.
SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters.
StopAnalyzer is the same as SimpleAnalyzer, except it removes common words (called stop words, described more in section XXX). By default it removes common words in the English language (the, a, etc.), though you can pass in your own set.
StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names,
四、其他内容
在建立IndexWriter時,須要指定分析器,如:
便在每次向writer中加入文檔時。能夠針對該文檔指定一個分析器,如