Lucene

lucene 簡介
lucene 入門 Demo
- 建立索引
- 使用索引
- 建構索引
- 文檔域權重
- 特定項搜尋
- 組合查詢
lucene 工具類
- 中文分詞器及高亮顯示

lucene 簡介

在以前，我們查詢資料是從資料庫查詢，點選浏覽器的界面查詢，請求到背景，再到資料庫SQL語句查詢，然後再傳回一個結果集展示到頁面。

而搜尋引擎，這個表的資料并不會直接展示到頁面上來，它将資料源通過 IndexWriter 檢索成 Document （一條資料就是一個 Document ），然後将這些索引後得到的檔案存放到指定的索引目錄，查詢時，利用 IndexSearcher 對象可快速從索引檔案夾中檢索出所查詢的資料，這些資料在 TopDocs 對象中。

注意：

lucene 查詢的資料量不易過多，容易出問題，但是十萬條資料還是綽綽有餘的。O(∩_∩)O

但是，lucene也是挺便捷的，導個jar包就能用。 (●’◡’●)

簡介

首頁： http://lucene.apache.org/

1.1 什麼是 lucene ？

Lucene 是一個開源的使用 java語言編寫的全文搜尋引擎開發包，可以融入到自己的項目中來實作增加索引和搜尋功能，其實就是一款高性能的、可擴充的資訊檢索工具庫。

和其他開源軟體一樣，有着與生俱來的優點：功能和結構的透明性、功能強大且具有較強的擴充性、技術社群的強大支援同時也友善技術交流。

Lucene 隻是一個搜尋的核心庫，并不提供具體的實作，但它的應用十分廣泛，

Solr、ElasticSearch 、Katta等底層用的都是Lucene。

其特點：API簡單，易于學習（但是不同的版本API差别較大）。

Lucene 的原理：倒排序索引

** 什麼是倒排序索引，什麼又是正排序索引呢？ **

個人了解：

倒排序索引：即經過Lucene分詞之後，它會維護一個類似于“詞條–文檔ID”的對應關系，當我們進行搜尋某個詞條的時候，就會得到相應的文檔 ID。

正排序索引：當我們進行搜尋的時候，會對整個文檔的内容進行搜尋，不維護“詞條–文檔ID”的對應關系，然後如果比對上，得到對應的文檔ID，這樣做肯定耗時，因為就像是資料庫裡的表缺少了索引一樣。

Lucene 核心API

2.1 索引過程中的核心類
```
Document、 Field 、IndexWriter、 Directroy、 Analyzer
           
```
Dcoument 文檔，一個文檔代表一些域(Field)的集合。他是承載資料的實體，是一個抽象的概念，并不是word 或者 Txt什麼的。

Document 代表一個被索引的基本單元。

構造方法
```
org.apache.lucene.document.Document.Document()
           
```
常用的API
```
org.apache.lucene.document.Document.add(IndexableField)//添加字段
           
```
Field 索引中的每一個 Document 對象都包含一個或者多個不同的域(Field)，域是由域名(name)和域值(value)對組成，每一個域都包含一段相應的資料資訊。

常用的構造方法：
```
org.apache.lucene.document.Field.Field(String, String, IndexableFieldType)
           
```
IndexWriter 是索引過程的核心元件,要利用 Analyzer（分詞器）和 FSDirectory（檔案存放位址）拿到 IndexWriter 。

這個類用于建立一個新的索引并且把文檔加到已有的索引中去。他可以提供對索引的寫入操作，但不能進行讀取或搜尋。

構造方法：
```
org.apache.lucene.index.IndexWriter.IndexWriter(Directory,IndexWriterConfig)
           
```
其核心API有：
```
org.apache.lucene.index.IndexWriter.addDocument(Iterable<? extends IndexableField>) //添加文檔
 				org.apache.lucene.index.IndexWriter.updateDocuments(Term, Iterable<? extends Iterable<? extends IndexableField>>) //新文檔
 				org.apache.lucene.index.IndexWriter.tryDeleteDocument(IndexReader, int) //删除文檔
 				org.apache.lucene.index.IndexWriter.deleteDocuments(Term...) //删除含有詞條的文檔
           
```
Directory 是索引的存放位置，是個抽象類。具體的子類提供特定的存儲索引的位址。

FSDirectory 将索引存放在指定的磁盤中，RAMDirectory ·将索引存放在記憶體中。

FSDirectory 的建立：
```
org.apache.lucene.store.FSDirectory.open(Path) //建立索引庫
 				org.apache.lucene.store.FSDirectory.listAll(Path) //列舉索引庫下的檔案
           
```
RAMDirectory 建立:
```
org.apache.lucene.store.RAMDirectory.RAMDirectory()
           
```
Analyzer 分詞器，在文本被索引之前，需要經過分詞器處理，他負責從将被索引的文檔中提取詞彙單元，并剔除剩下的無用資訊（停止詞彙），分詞器十分關鍵，因為不同的分詞器，解析相同的文檔結果會有很大的不同，預設标準分詞器（英文）。

Analyzer 是一個抽象類，是所有分詞器的基類。他通過TokenStream類似一種很好的方式，将文本逐字。

常用的分詞器：
```
org.apache.lucene.analysis.standard.StandardAnalyzer //标準粉瓷器
 				org.apache.lucene.analysis.core.SimpleAnalyzer //簡單分詞器
 				org.wltea.analyzer.lucene.IKAnalyzer //IK分詞器
           
```
2.2 搜尋過程中的核心類

IndexSearcher 、Term、Query、TermQuery、TopDocs

IndexSearcher 調用它的search方法，用于搜尋IndexWriter 所建立的索引。

構造方法：
```
org.apache.lucene.search.IndexSearcher.IndexSearcher(IndexReader)
           
```
常用API:
```
org.apache.lucene.search.IndexSearcher.search(Query, int) //進行搜尋傳回評分較高的前n個文檔
 				org.apache.lucene.search.IndexSearcher.searchAfterScoreDoc, Query, int) 
 				org.apache.lucene.search.IndexSearcher.search(Query, int, Sort)
           
```
Term 使用于搜尋的一個基本單元。

Query Lucene 中含有多種查詢（Query）子類。

比如，TermQuery（單詞條查詢）、BooleanQuery（布爾查詢）、PhraseQuery（短語搜尋）、PrefixQuery（字首搜尋）等。它們用于查詢條件的限定其中TermQuery 是Lucene提供的最基本的查詢類型，也是最簡單的，它主要用來比對在指定的域（Field）中包含了特定項(Term)的文檔。

TopDocs 是一個存放有序搜尋結果指針的簡單容器，在這裡搜尋的結果是指比對一個查詢條件的一系列的文檔。

檢視Lucene的分詞結果的工具----Luke

luke 各版本的下載下傳git位址： https://github.com/DmitryKey/luke/releases

注意：不要将該工具放到中文目錄下

具體應用：

打開索引庫目錄/檢視詞條/進行搜尋：

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

lucene 入門 Demo

<dependency>
			    <groupId>org.apache.lucene</groupId>
			    <artifactId>lucene-core</artifactId>
			    <version>5.3.1</version>
			</dependency>
			<dependency>
			    <groupId>org.apache.lucene</groupId>
			    <artifactId>lucene-queryparser</artifactId>
			    <version>5.3.1</version>
			</dependency>
			<dependency>
			    <groupId>org.apache.lucene</groupId>
			    <artifactId>lucene-analyzers-common</artifactId>
			    <version>5.3.1</version>
			</dependency>

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

目的：索引資料目錄，在指定目錄生成索引檔案

1、構造方法執行個體化 IndexWriter

u 擷取索引檔案存放位址對象

u 擷取輸出流

設定輸出流的對應配置

給輸出流配置設定分詞器
2、關閉索引輸出流
3、索引指定路徑下的所有檔案
4、索引指定的檔案
5、擷取文檔（索引檔案中包含的重要資訊，key-value的形式）
6、測試

建立索引

IndexCreate

package com.dj.lucene;

import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;

/**
 * 建立索引
 * 
 * 配合Demo1.java進行lucene的helloword實作
 * @author Administrator
 *
 */
public class IndexCreate {
	private IndexWriter indexWriter;
	
	/**
	 * 1、構造方法 執行個體化IndexWriter
	 * @param indexDir
	 * @throws Exception
	 */
	public IndexCreate(String indexDir) throws Exception{
//		擷取索引檔案的存放位址對象
		FSDirectory dir = FSDirectory.open(Paths.get(indexDir));
//		标準分詞器（針對英文）
		Analyzer analyzer = new StandardAnalyzer();
//		索引輸出流配置對象
		IndexWriterConfig conf = new IndexWriterConfig(analyzer); //分詞器包裝
		indexWriter = new IndexWriter(dir, conf);//生成IndexWriter
	}
	
	/**
	 * 2、關閉索引輸出流
	 * @throws Exception
	 */
	public void closeIndexWriter()  throws Exception{
		indexWriter.close();//關閉流的時候，自動存放到索引目錄檔案中
	}
	
	/**
	 * 3、索引指定路徑下的所有檔案
	 * @param dataDir
	 * @return
	 * @throws Exception
	 */
	public int index(String dataDir) throws Exception{
		File[] files = new File(dataDir).listFiles();//擷取索引檔案夾裡的所有檔案
		for (File file : files) {//周遊
			indexFile(file);//索引單個檔案  一個一個來
		}
		return indexWriter.numDocs();
	}
	
	/**
	 * 4、索引單個指定的檔案
	 * @param file
	 * @throws Exception
	 */
	private void indexFile(File file) throws Exception{
		System.out.println("被索引檔案的全路徑："+file.getCanonicalPath());
		Document doc = getDocument(file);//一個檔案對應一個Document
		indexWriter.addDocument(doc);//添加到流
	}
	
	/**
	 * 5、擷取文檔（索引檔案中包含的重要資訊，key-value的形式）
	 * @param file
	 * @return
	 * @throws Exception
	 */
	private Document getDocument(File file) throws Exception{
		Document doc = new Document();//索引檔案 doc
		doc.add(new TextField("contents", new FileReader(file)));//搜尋傳入的關鍵字 根據這個搜尋
//		Field.Store.YES是否存儲到硬碟
		doc.add(new TextField("fullPath", file.getCanonicalPath(),Field.Store.YES));
		doc.add(new TextField("fileName", file.getName(),Field.Store.YES));
		return doc;
	}
}

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

使用索引

從索引檔案中拿資料

1、擷取輸入流（通過dirReader）
2、擷取索引搜尋對象（通過輸入流來拿）
3、擷取查詢對象（通過查詢解析器來擷取，解析器是通過分詞器擷取）
4、擷取包含關鍵字排前面的文檔對象集合
5、可以擷取對應文檔的内容

IndexUse

package com.dj.lucene;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;

/**
 * 索引使用
 * 
 * 配合Demo2.java進行lucene的helloword實作
 * @author Administrator
 *
 */
public class IndexUse {
	/**
	 * 通過關鍵字在索引目錄中查詢
	 * @param indexDir	索引檔案所在目錄
	 * @param q	關鍵字
	 */
	public static void search(String indexDir, String q) throws Exception{
		FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));//目錄對象
//		注意:索引輸入流不是new出來的，是通過目錄讀取工具類打開的
		IndexReader indexReader = DirectoryReader.open(indexDirectory);//讀取目錄對象打開索引輸入流
//		擷取索引搜尋對象
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);
		Analyzer analyzer = new StandardAnalyzer();//标準分詞器
		QueryParser queryParser = new QueryParser("contents", analyzer);//查詢解析器-解析索引檔案   contents--輸入查詢搜尋的内容
//		擷取符合關鍵字的查詢對象
		Query query = queryParser.parse(q);
		
		long start=System.currentTimeMillis();
//		擷取關鍵字出現的前十次資料
		TopDocs topDocs = indexSearcher.search(query , 10);//資料集合
		long end=System.currentTimeMillis();
		System.out.println("比對 "+q+" ，總共花費"+(end-start)+"毫秒"+"查詢到"+topDocs.totalHits+"個記錄");
		
		for (ScoreDoc scoreDoc : topDocs.scoreDocs) {//周遊資料集合
			int docID = scoreDoc.doc;
//			索引搜尋對象通過文檔下标擷取文檔
			Document doc = indexSearcher.doc(docID);//拿到Document
			System.out.println("通過索引檔案："+doc.get("fullPath")+"拿資料");
		}
		
		indexReader.close();
	}
}

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

建構索引

package com.dj.lucene;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;

/**
 * 建構索引
 * 
 * 	對索引的增  删改(重點)
 * @author Administrator
 *
 */
public class Demo3 {
	private String ids[]={"1","2","3"};
	private String citys[]={"qingdao","nanjing","shanghai"};
	private String descs[]={
			"Qingdao is a beautiful city.",
			"Nanjing is a city of culture.",
			"Shanghai is a bustling city."
	};
	private FSDirectory dir;//索引檔案目錄
	
	/**
	 * 每次都生成索引檔案
	 * @throws Exception
	 */
	@Before
	public void setUp() throws Exception {
		dir  = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
		//獲得IndexWriter
		IndexWriter indexWriter = getIndexWriter();
		for (int i = 0; i < ids.length; i++) {
			Document doc = new Document();
			doc.add(new StringField("id", ids[i], Field.Store.YES));
			doc.add(new StringField("city", citys[i], Field.Store.YES));
			doc.add(new TextField("desc", descs[i], Field.Store.NO));
			indexWriter.addDocument(doc);
		}
		indexWriter.close();//關閉流
	}

	/**
	 * 擷取索引輸出流
	 * @return
	 * @throws Exception
	 */
	private IndexWriter getIndexWriter()  throws Exception{
		Analyzer analyzer = new StandardAnalyzer();//分詞器
		IndexWriterConfig conf = new IndexWriterConfig(analyzer);
		return new IndexWriter(dir, conf );
	}
	
	/**
	 * 測試寫了幾個索引檔案
	 * @throws Exception
	 */
	@Test
	public void getWriteDocNum() throws Exception {
		IndexWriter indexWriter = getIndexWriter();
		System.out.println("索引目錄下生成"+indexWriter.numDocs()+"個索引檔案");
	}
	
	/**
	 * 删除索引，但資料還在，document還在
	 * 
	 * 打上标記，該索引實際并未删除
	 * @throws Exception
	 */
	@Test
	public void deleteDocBeforeMerge() throws Exception {
		IndexWriter indexWriter = getIndexWriter();
		System.out.println("最大文檔數："+indexWriter.maxDoc());
		indexWriter.deleteDocuments(new Term("id", "1"));
		indexWriter.commit();
		
		System.out.println("最大文檔數："+indexWriter.maxDoc());
		System.out.println("實際文檔數："+indexWriter.numDocs());
		indexWriter.close();
	}
	
	/**
	 * 資料删除
	 * 
	 * 對應索引檔案已經删除,但是該版本的分詞會保留
	 * @throws Exception
	 */
	@Test
	public void deleteDocAfterMerge() throws Exception {
//		https://blog.csdn.net/asdfsadfasdfsa/article/details/78820030
//		org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine:indexWriter是單例的、線程安全的，不允許打開多個。
		IndexWriter indexWriter = getIndexWriter();
		System.out.println("最大文檔數："+indexWriter.maxDoc());
		indexWriter.deleteDocuments(new Term("id", "1"));
		indexWriter.forceMergeDeletes(); //強制删除
		indexWriter.commit();
		
		System.out.println("最大文檔數："+indexWriter.maxDoc());
		System.out.println("實際文檔數："+indexWriter.numDocs());
		indexWriter.close();
	}
	
	/**
	 * 測試更新索引
	 * @throws Exception
	 */
	@Test
	public void testUpdate()throws Exception{
		IndexWriter writer=getIndexWriter();
		Document doc=new Document();
		doc.add(new StringField("id", "1", Field.Store.YES));
		doc.add(new StringField("city","qingdao",Field.Store.YES));
		doc.add(new TextField("desc", "dsss is a city.", Field.Store.NO));
		writer.updateDocument(new Term("id","1"), doc);
		writer.close();
	}
}

文檔域權重

package com.dj.lucene;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;

/**
 * 文檔域權重
 * 
 * @author Administrator
 *
 */
public class Demo4 {
	private String ids[]={"1","2","3","4"};
	private String authors[]={"Jack","Marry","John","Json"};
	private String positions[]={"accounting","technician","salesperson","boss"};
	private String titles[]={"Java is a good language.","Java is a cross platform language","Java powerful","You should learn java"};
	private String contents[]={
			"If possible, use the same JRE major version at both index and search time.",
			"When upgrading to a different JRE major version, consider re-indexing. ",
			"Different JRE major versions may implement different versions of Unicode,",
			"For example: with Java 1.4, `LetterTokenizer` will split around the character U+02C6,"
	};
	
	private Directory dir;//索引檔案目錄

	/**
	 * 每次都生成索引檔案
	 * @throws Exception
	 */
	@Before
	public void setUp()throws Exception {
		dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo3\\indexDir"));
		//獲得IndexWriter
		IndexWriter writer = getIndexWriter();
		for (int i = 0; i < authors.length; i++) {
			Document doc = new Document();
			doc.add(new StringField("id", ids[i], Field.Store.YES));
			doc.add(new StringField("author", authors[i], Field.Store.YES));
			doc.add(new StringField("position", positions[i], Field.Store.YES));
			
			TextField textField = new TextField("title", titles[i], Field.Store.YES);
			
//			Json投錢做廣告，把排名刷到第一了
			//if("boss".equals(positions[i])) {
			//	textField.setBoost(2f);//設定權重，預設為1
			//}
			
			doc.add(textField);
//			TextField會分詞，StringField不會分詞
			doc.add(new TextField("content", contents[i], Field.Store.NO));
			writer.addDocument(doc);
		}
		writer.close();//關閉流
		
	}

	/**
	 * 擷取索引輸出流
	 * @return
	 * @throws Exception
	 */
	private IndexWriter getIndexWriter() throws Exception{
		Analyzer analyzer = new StandardAnalyzer();//分詞器
		IndexWriterConfig conf = new IndexWriterConfig(analyzer);
		return new IndexWriter(dir, conf);
	}
	
	@Test
	public void index() throws Exception{
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher searcher = new IndexSearcher(reader);
		String fieldName = "title";
		String keyWord = "java";
		Term t = new Term(fieldName, keyWord);
		Query query = new TermQuery(t);
		TopDocs hits = searcher.search(query, 10);
		System.out.println("關鍵字：‘"+keyWord+"’命中了"+hits.totalHits+"次");
		for (ScoreDoc scoreDoc : hits.scoreDocs) {
			Document doc = searcher.doc(scoreDoc.doc);
			System.out.println(doc.get("author"));
		}
	}
	
}

這是正常排名：

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

Json投錢打廣告後的排名：

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

特定項搜尋

package com.dj.lucene;

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;

/**
 * 特定項搜尋
 * 
 * 查詢表達式（queryParser）
 * @author Administrator
 *
 */
public class Demo5 {
	@Before
	public void setUp() {
		// 索引檔案将要存放的位置
		String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
		// 資料源位址
		String dataDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4\\data";
		IndexCreate ic = null;//建立索引 ic
		try {
			ic = new IndexCreate(indexDir);
			long start = System.currentTimeMillis();
			int num = ic.index(dataDir);
			long end = System.currentTimeMillis();
			System.out.println("檢索指定路徑下" + num + "個檔案，一共花費了" + (end - start) + "毫秒");
			
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			try {
				ic.closeIndexWriter();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
	}
	
	/**
	 * 特定項搜尋
	 */
	@Test
	public void testTermQuery() {
		// 索引檔案将要存放的位置
		String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
		
		String fld = "contents";//搜尋關鍵字
		String text = "indexformattoooldexception";//搜尋片段名
//		特定項片段名和關鍵字
		Term t  = new Term(fld , text);
		TermQuery tq = new TermQuery(t  );
		try {
			FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));
//			注意:索引輸入流不是new出來的，是通過目錄讀取工具類打開的
			IndexReader indexReader = DirectoryReader.open(indexDirectory);
//			擷取索引搜尋對象
			IndexSearcher is = new IndexSearcher(indexReader);
			
			TopDocs hits = is.search(tq, 100);
//			System.out.println(hits.totalHits);
			for(ScoreDoc scoreDoc: hits.scoreDocs) {
				Document doc = is.doc(scoreDoc.doc);
				System.out.println("檔案"+doc.get("fullPath")+"中含有該關鍵字");
				
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	/**
	 * 查詢表達式（queryParser）
	 */
	@Test
	public void testQueryParser() {
		// 索引檔案将要存放的位置
		String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
//		擷取查詢解析器（通過哪種分詞器去解析哪種片段）
		QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer());
		try {
			FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));//擷取索引檔案目錄
//			注意:索引輸入流不是new出來的，是通過目錄讀取工具類打開的
			IndexReader indexReader = DirectoryReader.open(indexDirectory);//通過目錄讀取工具類打開索引輸入流
//			擷取索引搜尋對象
			IndexSearcher is = new IndexSearcher(indexReader);
			
//			由解析器去解析對應的關鍵字
			TopDocs hits = is.search(queryParser.parse("indexformattoooldexception") , 100);//查詢的關鍵字：indexformattoooldexception
			for(ScoreDoc scoreDoc: hits.scoreDocs) {
				Document doc = is.doc(scoreDoc.doc);
				System.out.println("檔案"+doc.get("fullPath")+"中含有該關鍵字");
				
			}
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		}
	}
	
}

組合查詢

package com.dj.lucene;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;

/**
 * 指定數字範圍查詢
 * 
 * 指定字元串開頭字母查詢（prefixQuery）
 * 
 * @author Administrator
 *
 */
public class Demo6 {
	private int ids[]={1,2,3};
	private String citys[]={"qingdao","nanjing","shanghai"};
	private String descs[]={
			"Qingdao is a beautiful city.",
			"Nanjing is a city of culture.",
			"Shanghai is a bustling city."
	};
	private FSDirectory dir;
	
	/**
	 * 每次都生成索引檔案
	 * @throws Exception
	 */
	@Before
	public void setUp() throws Exception {
		dir  = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
		IndexWriter indexWriter = getIndexWriter();
		for (int i = 0; i < ids.length; i++) {
			Document doc = new Document();
			doc.add(new IntField("id", ids[i], Field.Store.YES));
			doc.add(new StringField("city", citys[i], Field.Store.YES));
			doc.add(new TextField("desc", descs[i], Field.Store.YES));
			indexWriter.addDocument(doc);
		}
		indexWriter.close();
	}
	
	/**
	 * 擷取索引輸出流
	 * @return
	 * @throws Exception
	 */
	private IndexWriter getIndexWriter()  throws Exception{
		Analyzer analyzer = new StandardAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(analyzer);
		return new IndexWriter(dir, conf );
	}
	
	/**
	 * 指定數字範圍查詢
	 * @throws Exception
	 */
	@Test
	public void testNumericRangeQuery()throws Exception{
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher is = new IndexSearcher(reader);
		
		NumericRangeQuery<Integer> query=NumericRangeQuery.newIntRange("id", 1, 2, true, true);//閉區間  [1,2]
		TopDocs hits=is.search(query, 10);
		for(ScoreDoc scoreDoc:hits.scoreDocs){
			Document doc=is.doc(scoreDoc.doc);
			System.out.println(doc.get("id"));
			System.out.println(doc.get("city"));
			System.out.println(doc.get("desc"));
		}		
	}
	
	/**
	 * 指定字元串開頭字母查詢（prefixQuery）
	 * @throws Exception
	 */
	@Test
	public void testPrefixQuery()throws Exception{
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher is = new IndexSearcher(reader);
		
		PrefixQuery query=new PrefixQuery(new Term("city","n"));
		TopDocs hits=is.search(query, 10);
		for(ScoreDoc scoreDoc:hits.scoreDocs){
			Document doc=is.doc(scoreDoc.doc);
			System.out.println(doc.get("id"));
			System.out.println(doc.get("city"));
			System.out.println(doc.get("desc"));
		}	
	}
	
	/**
	 * 組合查詢
	 * 
	 * 常用這個查詢
	 * @throws Exception
	 */
	@Test
	public void testBooleanQuery()throws Exception{
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher is = new IndexSearcher(reader);
		
		NumericRangeQuery<Integer> query1=NumericRangeQuery.newIntRange("id", 1, 2, true, true);
		PrefixQuery query2=new PrefixQuery(new Term("city","n"));//區分大小寫
		
		BooleanQuery.Builder booleanQuery=new BooleanQuery.Builder();//建構查詢條件
		booleanQuery.add(query1,BooleanClause.Occur.MUST);
		booleanQuery.add(query2,BooleanClause.Occur.MUST);
		
		TopDocs hits=is.search(booleanQuery.build(), 10);//搜尋查詢
		for(ScoreDoc scoreDoc:hits.scoreDocs){
			Document doc=is.doc(scoreDoc.doc);
			System.out.println(doc.get("id"));
			System.out.println(doc.get("city"));
			System.out.println(doc.get("desc"));
		}	
	}
}

lucene 工具類

LuceneUtil

package com.dj.blog.util;

import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryTermScorer;
import org.apache.lucene.search.highlight.Scorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;

/**
 * lucene工具類
 * 
 * @author Administrator
 *
 */
public class LuceneUtil {

	/**
	 * 擷取索引檔案存放的檔案夾對象
	 * 
	 * @param path
	 * @return
	 */
	public static Directory getDirectory(String path) {
		Directory directory = null;
		try {
			directory = FSDirectory.open(Paths.get(path));
		} catch (IOException e) {
			e.printStackTrace();
		}
		return directory;
	}

	/**
	 * 索引檔案存放在記憶體
	 * 
	 * 一般不用
	 * @return
	 */
	public static Directory getRAMDirectory() {
		Directory directory = new RAMDirectory();
		return directory;
	}

	/**
	 * 檔案夾讀取對象
	 * 
	 * @param directory
	 * @return
	 */
	public static DirectoryReader getDirectoryReader(Directory directory) {
		DirectoryReader reader = null;
		try {
			reader = DirectoryReader.open(directory);
		} catch (IOException e) {
			e.printStackTrace();
		}
		return reader;
	}

	/**
	 * 檔案索引對象
	 * 
	 * @param reader
	 * @return
	 */
	public static IndexSearcher getIndexSearcher(DirectoryReader reader) {
		IndexSearcher indexSearcher = new IndexSearcher(reader);
		return indexSearcher;
	}

	/**
	 * 寫入索引對象
	 * 
	 * @param directory
	 * @param analyzer
	 * @return
	 */
	public static IndexWriter getIndexWriter(Directory directory, Analyzer analyzer)

	{
		IndexWriter iwriter = null;
		try {
			IndexWriterConfig config = new IndexWriterConfig(analyzer);
			config.setOpenMode(OpenMode.CREATE_OR_APPEND);
			// Sort sort=new Sort(new SortField("content", Type.STRING));
			// config.setIndexSort(sort);//排序
			config.setCommitOnClose(true);
			// 自動送出
			// config.setMergeScheduler(new ConcurrentMergeScheduler());
			// config.setIndexDeletionPolicy(new
			// SnapshotDeletionPolicy(NoDeletionPolicy.INSTANCE));
			iwriter = new IndexWriter(directory, config);
		} catch (IOException e) {
			e.printStackTrace();
		}
		return iwriter;
	}

	/**
	 * 關閉索引檔案生成對象以及檔案夾對象
	 * 
	 * @param indexWriter
	 * @param directory
	 */
	public static void close(IndexWriter indexWriter, Directory directory) {
		if (indexWriter != null) {
			try {
				indexWriter.close();
			} catch (IOException e) {
				indexWriter = null;
			}
		}
		if (directory != null) {
			try {
				directory.close();
			} catch (IOException e) {
				directory = null;
			}
		}
	}

	/**
	 * 關閉索引檔案讀取對象以及檔案夾對象
	 * 
	 * @param reader
	 * @param directory
	 */
	public static void close(DirectoryReader reader, Directory directory) {
		if (reader != null) {
			try {
				reader.close();
			} catch (IOException e) {
				reader = null;
			}
		}
		if (directory != null) {
			try {
				directory.close();
			} catch (IOException e) {
				directory = null;
			}
		}

	}

	/**
	 * 高亮标簽
	 * 
	 * @param query
	 * @param fieldName
	 * @return
	 */

	public static Highlighter getHighlighter(Query query, String fieldName)

	{
		Formatter formatter = new SimpleHTMLFormatter("<span style='color:red'>", "</span>");
		Scorer fragmentScorer = new QueryTermScorer(query, fieldName);
		Highlighter highlighter = new Highlighter(formatter, fragmentScorer);
		highlighter.setTextFragmenter(new SimpleFragmenter(200));
		return highlighter;
	}
}

建構lucene索引：

package com.dj.blog.web;

import java.io.IOException;
import java.nio.file.Paths;
import java.sql.SQLException;
import java.util.List;
import java.util.Map;

import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import com.dj.blog.dao.BlogDao;
import com.dj.blog.util.PropertiesUtil;


/**
 * 建構lucene索引
 * @author Administrator
 * 1。建構索引	IndexWriter
 * 2、讀取索引檔案，擷取命中片段
 * 3、使得命中片段高亮顯示
 *
 */
public class IndexStarter {
	private static BlogDao blogDao = new BlogDao();
	public static void main(String[] args) {
		IndexWriterConfig conf = new IndexWriterConfig(new SmartChineseAnalyzer());
		Directory d;
		IndexWriter indexWriter = null;
		try {
			d = FSDirectory.open(Paths.get(PropertiesUtil.getValue("indexPath")));
			indexWriter = new IndexWriter(d , conf );
			
//			為資料庫中的所有資料建構索引
			List<Map<String, Object>> list = blogDao.list(null, null);
			for (Map<String, Object> map : list) {
				Document doc = new Document();
				doc.add(new StringField("id", (String) map.get("id"), Field.Store.YES));
//				TextField用于對一句話分詞處理	java教育訓練機構
				doc.add(new TextField("title", (String) map.get("title"), Field.Store.YES));
				doc.add(new StringField("url", (String) map.get("url"), Field.Store.YES));
				indexWriter.addDocument(doc);
			}
			
		} catch (IOException e) {
			e.printStackTrace();
		} catch (InstantiationException e) {
			e.printStackTrace();
		} catch (IllegalAccessException e) {
			e.printStackTrace();
		} catch (SQLException e) {
			e.printStackTrace();
		}finally {
			try {
				if(indexWriter!= null) {
					indexWriter.close();
				}
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
	}
}

中文分詞器及高亮顯示

package com.dj.lucene;

import java.io.StringReader;
import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;

/**
 * 中文分詞器
 * 
 * 高亮顯示
 * 
 * @author 86182
 *
 */
public class Demo7 {
	private Integer ids[] = { 1, 2, 3 };
	private String citys[] = { "青島", "南京", "上海" };
	// private String descs[]={
	// "青島是個美麗的城市。",
	// "南京是個有文化的城市。",
	// "上海市個繁華的城市。"
	// };
	private String descs[] = { "青島是個美麗的城市。",
			"南京是一個文化的城市南京，簡稱甯，是江蘇省會，地處中國東部地區，長江下遊，瀕江近海。全市下轄11個區，總面積6597平方公裡，2013年建成區面積752.83平方公裡，常住人口818.78萬，其中城鎮人口659.1萬人。[1-4] “江南佳麗地，金陵帝王州”，南京擁有着6000多年文明史、近2600年建城史和近500年的建都史，是中國四大古都之一，有“六朝古都”、“十朝都會”之稱，是中華文明的重要發祥地，曆史上曾數次庇佑華夏之正朔，長期是中國南方的政治、經濟、文化中心，擁有厚重的文化底蘊和豐富的曆史遺存。[5-7] 南京是國家重要的科教中心，自古以來就是一座崇文重教的城市，有“天下文樞”、“東南第一學”的美譽。截至2013年，南京有高等院校75所，其中211高校8所，僅次于北京上海；國家重點實驗室25所、國家重點學科169個、兩院院士83人，均居中國第三。[8-10]",
			"上海市個繁華的城市。" };

	private FSDirectory dir;

	/**
	 * 每次都生成索引檔案
	 * 
	 * @throws Exception
	 */
	@Before
	public void setUp() throws Exception {
		dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
		IndexWriter indexWriter = getIndexWriter();
		for (int i = 0; i < ids.length; i++) {
			Document doc = new Document();
			doc.add(new IntField("id", ids[i], Field.Store.YES));
			doc.add(new StringField("city", citys[i], Field.Store.YES));
			doc.add(new TextField("desc", descs[i], Field.Store.YES));
			indexWriter.addDocument(doc);
		}
		indexWriter.close();
	}

	/**
	 * 擷取索引輸出流
	 * 
	 * @return
	 * @throws Exception
	 */
	private IndexWriter getIndexWriter() throws Exception {
//		Analyzer analyzer = new StandardAnalyzer();//預設分詞器
		Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞器
		IndexWriterConfig conf = new IndexWriterConfig(analyzer);
		return new IndexWriter(dir, conf);
	}

	/**
	 * luke檢視索引生成
	 * 
	 * @throws Exception
	 */
	@Test
	public void testIndexCreate() throws Exception {

	}

	/**
	 * 測試高亮
	 * 
	 * @throws Exception
	 */
	@Test
	public void testHeight() throws Exception {
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher searcher = new IndexSearcher(reader);

		SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
		QueryParser parser = new QueryParser("desc", analyzer);
		// Query query = parser.parse("南京文化");
		Query query = parser.parse("南京文明");
		TopDocs hits = searcher.search(query, 100);

		// 查詢得分項
		QueryScorer queryScorer = new QueryScorer(query);
		// 得分項對應的内容片段
		SimpleSpanFragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
		// 高亮顯示的樣式
		SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<span color='red'><b>", "</b></span>");
		// 高亮顯示對象
		Highlighter highlighter = new Highlighter(htmlFormatter, queryScorer);
		// 設定需要高亮顯示對應的内容片段
		highlighter.setTextFragmenter(fragmenter);

		for (ScoreDoc scoreDoc : hits.scoreDocs) {
			//索引搜尋對象通過文檔下标擷取文檔
			Document doc = searcher.doc(scoreDoc.doc);
			String desc = doc.get("desc");
			if (desc != null) {
				// tokenstream是從doucment的域（field)中抽取的一個個分詞而組成的一個資料流，用于分詞。
				TokenStream tokenStream = analyzer.tokenStream("desc", new StringReader(desc));
				System.out.println("高亮顯示的片段：" + highlighter.getBestFragment(tokenStream, desc));
			}
			System.out.println("所有内容：" + desc);
		}

	}

}

标題為空時：從資料庫查找資料

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

标題不為空時：從索引目錄查找

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

注意：索引目錄下一定要有索引檔案，否則會報錯！！！

no segments* file found in [email protected]:\Software\Softwarepath\lucene\demo\text lockFactory=[email protected]: files: []

Lucene 的使用lucene 簡介lucene 入門 Demolucene 工具類

Lucene

lucene 簡介

lucene 入門 Demo

建立索引

使用索引

建構索引

文檔域權重

特定項搜尋

組合查詢

lucene 工具類

中文分詞器及高亮顯示

繼續閱讀

解析pdf、word2003、Excel2003、word2007、Excel2007、PowerPoint、Text 可用于Lucene

eclipse中配置heritrix的圖文過程----heritrix-1.14.3

Lucene 基本原理

ajax技術學習網址

Ajax學習--網址備忘錄

開放源代碼搜尋引擎

轉：基于lucene實作自己的推薦引擎

基于LUCENE實作自己的推薦引擎

Lucene.net和盤古分詞使用小結

Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

JFLex使用者手冊中文版安裝與配置運作JFLEX 配置檔案編寫

svn配置權限

MySQL和Lucene索引對比分析1. MySQL索引實作2. Lucene索引實作3. MySQL與Lucence對比參考：

Lucence的基本原理

lucene 關鍵字高亮

專家訪談：搜尋開源力量：Lucene技術前景