Lucene入門示例

主要參考了Lucene的官方示例

Lucene入門示例

環境：Win7 + JDK1.6 + Eclipse37

Lucene版本：3.5

官方： http://www.apache.org/dyn/closer.cgi

檢索的基本概念

一資訊檢索:從資訊集合中打找出與使用者相關的資訊.

1 資訊檢索的分類

全文檢索 :把使用者的查詢請求和全文中的每一個詞進行比較不考慮查詢請求與文本語義的比對。

資料檢索 :查詢要求和資訊系統中的資料都有一定的結構，語義比對能力差.

知識檢索 :強調基于知識語義上的比對

說明以下介紹來自于百科名片, http://baike.baidu.com/view/371811.htm

二 Lucene介紹

Lucene是apache軟體基金會jakarta項目組的一個子項目,是一個開放源代碼的全文檢索引擎工具包，即它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎.Lucene的原作者是Doug Cutting，他是一位資深全文索引/檢索專家.

優點如下：

1 索引檔案格式獨立于應用平台。Lucene定義了一套以8位位元組為基礎的索引檔案格式，使得相容系統或者不同平台的應用能夠共享建立的索引檔案。、

2 在傳統全文檢索引擎的反向索引的基礎上，實作了分塊索引，能夠針對新的檔案建立小檔案索引，提升索引速度。

3 設計了獨立于語言和檔案格式的文本分析接口，索引器通過接受Token流完成索引檔案的創立，使用者擴充新的語言和檔案格式，隻需要實作文本分析的接口。

4 Lucene的查詢實作中預設實作了布爾操作、模糊查詢（Fuzzy Search[11]）、分組查詢等.

三工程圖檔如下,所用jar檔案包含：lucene-core-3.5.0.jar，lucene-analyzers-3.5.0.jar.

Lucene入門示例

四想要搜尋任何内容，必須先收集資料，建立索引庫,之後才能進行搜尋。

具體實作類如下：

Java代碼

Lucene入門示例

package net.liuzd.lucene.test;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.junit.Test;
public class IndeSearchFiles {
@Test
public void createIndex() throws Exception{
//操作增，删,改索引庫的
IndexWriter writer = LuceneUtils.createIndexWriter(OpenMode.CREATE);
//資料源的位置
File sourceFile = LuceneUtils.createSourceFile();
System.out.println("檔案路徑：" + sourceFile.getAbsolutePath());
//進行寫入文檔
Document doc = new Document();
doc.add(new Field("name",sourceFile.getName(),Field.Store.YES, Field.Index.ANALYZED_NO_NORMS));
//檔案路徑
Field pathField = new Field("path", sourceFile.getPath(), Field.Store.YES, Field.Index.NO);
pathField.setIndexOptions(org.apache.lucene.index.FieldInfo.IndexOptions.DOCS_ONLY);
doc.add(pathField);
//檔案最後修改時間
doc.add(new Field("modified",String.valueOf(sourceFile.lastModified()),Field.Store.YES, Field.Index.NO));
//添加檔案内容
String content = LuceneUtils.readFileContext(sourceFile);
System.out.println("content: " + content);
doc.add(new Field("contents",content,Field.Store.YES, Field.Index.ANALYZED));
//以下是官網的實作
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE)
{
writer.addDocument(doc);
}
else
{
writer.updateDocument(new Term("path", sourceFile.getPath()), doc);
}
//釋放資源
writer.close();
// fis.close();
}
@Test
public void search() throws Exception{
//查詢的字元串:輸入不存在的字元串是查詢不到的,如：中國
String queryString = "Lucene";
//查詢字段集合
String [] queryFileds = {"contents"};
IndexSearcher searcher = LuceneUtils.createIndexSearcher();
Query query = LuceneUtils.createQuery(queryFileds, queryString);
//在搜尋器中進行查詢
//對查詢内容進行過濾
Filter filter = null;
//一次在索引器查詢多少條資料
int queryCount = 100;
TopDocs results = searcher.search(query,filter,queryCount);
System.out.println("總符合: " + results.totalHits + "條數！");
//顯示記錄
for(ScoreDoc sr : results.scoreDocs){
//文檔編号
int docID = sr.doc;
//真正的内容
Document doc = searcher.doc(docID);
System.out.println("name = " + doc.get("name"));
System.out.println("path = " + doc.get("path"));
System.out.println("modified = " + doc.get("modified"));
System.out.println("contents = " + doc.get("contents"));
}
}
}

工具類代碼如下：

Java代碼

Lucene入門示例

package net.liuzd.lucene.test;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class LuceneUtils {
//目前目錄位置
public static final String USERDIR = System.getProperty("user.dir");
//存放索引的目錄
private static final String INDEXPATH = USERDIR + File.separator + "index";
//資料源
private static final String INDEXSOURCE = USERDIR + File.separator
+ "source" + File.separator + "lucene.txt";
//使用版本
public static final Version version = Version.LUCENE_35;
public static Analyzer getAnalyzer(){
// 分詞器
Analyzer analyzer = new StandardAnalyzer(version);
return analyzer;
}
public static IndexWriter createIndexWriter(OpenMode openMode)
throws Exception {
// 索引存放位置設定
Directory dir = FSDirectory.open(new File(INDEXPATH));
// 索引配置類設定
IndexWriterConfig iwc = new IndexWriterConfig(version,
getAnalyzer());
iwc.setOpenMode(openMode);
IndexWriter writer = new IndexWriter(dir, iwc);
return writer;
}
public static IndexSearcher createIndexSearcher() throws CorruptIndexException, IOException {
IndexReader reader = IndexReader.open(FSDirectory.open(new File(INDEXPATH)));
IndexSearcher searcher = new IndexSearcher(reader);
return searcher;
}
public static Query createQuery(String [] queryFileds,String queryString) throws ParseException{
QueryParser parser = new MultiFieldQueryParser(version, queryFileds, getAnalyzer());
Query query = parser.parse(queryString);
return query;
}
public static String readFileContext(File file){
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
StringBuilder content = new StringBuilder();
for(String line = null; (line = br.readLine())!= null;){
content.append(line).append("\n");
}
return content.toString();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) {
System.out.println(Thread.currentThread().getContextClassLoader()
.getResource(""));
System.out.println(LuceneUtils.class.getClassLoader().getResource(""));
System.out.println(ClassLoader.getSystemResource(""));
System.out.println(LuceneUtils.class.getResource(""));
System.out.println(LuceneUtils.class.getResource("/")); // Class檔案所在路徑
System.out.println(new File("/").getAbsolutePath());
System.out.println(System.getProperty("user.dir"));
}
public static File createSourceFile() {
File file = new File(INDEXSOURCE);
return file;
}
}

附件有工程源碼與jar檔案

Lucene入門示例

Lucene入門示例

繼續閱讀

解析pdf、word2003、Excel2003、word2007、Excel2007、PowerPoint、Text 可用于Lucene

eclipse中配置heritrix的圖文過程----heritrix-1.14.3

Lucene 基本原理

ajax技術學習網址

Ajax學習--網址備忘錄

開放源代碼搜尋引擎

轉：基于lucene實作自己的推薦引擎

基于LUCENE實作自己的推薦引擎

Lucene.net和盤古分詞使用小結

Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

JFLex使用者手冊中文版安裝與配置運作JFLEX 配置檔案編寫

svn配置權限

MySQL和Lucene索引對比分析1. MySQL索引實作2. Lucene索引實作3. MySQL與Lucence對比參考：

Lucence的基本原理

lucene 關鍵字高亮

專家訪談：搜尋開源力量：Lucene技術前景