Lucene
- lucene 簡介
- lucene 入門 Demo
-
- 建立索引
- 使用索引
- 建構索引
- 文檔域權重
- 特定項搜尋
- 組合查詢
- lucene 工具類
-
- 中文分詞器及高亮顯示
lucene 簡介
在以前,我們查詢資料是從資料庫查詢,點選浏覽器的界面查詢,請求到背景,再到資料庫SQL語句查詢,然後再傳回一個結果集展示到頁面 。
而搜尋引擎,這個表的資料并不會直接展示到頁面上來,它将資料源通過 IndexWriter 檢索成 Document (一條資料就是一個 Document ),然後将這些索引後得到的檔案存放到指定的索引目錄,查詢時,利用 IndexSearcher 對象可快速從索引檔案夾中檢索出所查詢的資料,這些資料在 TopDocs 對象中 。
注意:
lucene 查詢的資料量不易過多,容易出問題,但是十萬條資料還是綽綽有餘的 。O(∩_∩)O
但是,lucene也是挺便捷的,導個jar包就能用 。 (●’◡’●)
- 簡介
首頁: http://lucene.apache.org/
1.1 什麼是 lucene ?
Lucene 是一個開源的使用 java語言編寫的 全文搜尋引擎開發包 ,可以融入到自己的項目中來實作增加索引和搜尋功能,其實就是一款高性能的、可擴充的資訊檢索工具庫 。
和其他開源軟體一樣,有着與生俱來的優點:功能和結構的透明性、功能強大且具有較強的擴充性、技術社群的強大支援同時也友善技術交流 。
Lucene 隻是一個搜尋的核心庫,并不提供具體的實作,但它的應用十分廣泛,
Solr、ElasticSearch 、Katta等底層用的都是Lucene。
其特點:API簡單,易于學習(但是不同的版本API差别較大)。
Lucene 的原理: 倒排序索引
** 什麼是倒排序索引,什麼又是正排序索引呢? **
個人了解:
倒排序索引: 即經過Lucene分詞之後,它會維護一個類似于“詞條–文檔ID”的對應關系,當我們進行搜尋某個詞條的時候,就會得到相應的文檔 ID。
正排序索引: 當我們進行搜尋的時候,會對整個文檔的内容進行搜尋,不維護“詞條–文檔ID”的對應關系,然後如果比對上,得到對應的文檔ID,這樣做肯定耗時,因為就像是資料庫裡的表缺少了索引一樣。
-
Lucene 核心API
2.1 索引過程中的核心類
Document、 Field 、IndexWriter、 Directroy、 Analyzer
Dcoument 文檔,一個文檔代表一些域(Field)的集合。他是承載資料的實體,是一個抽象的概念,并不是word 或者 Txt什麼的 。
Document 代表一個被索引的基本單元 。
構造方法
常用的APIorg.apache.lucene.document.Document.Document()
org.apache.lucene.document.Document.add(IndexableField)//添加字段
Field 索引中的每一個 Document 對象都包含一個或者多個不同的域(Field),域是由域名(name)和域值(value)對組成,每一個域都包含一段相應的資料資訊。
常用的構造方法:
org.apache.lucene.document.Field.Field(String, String, IndexableFieldType)
IndexWriter 是索引過程的核心元件,要利用 Analyzer(分詞器)和 FSDirectory(檔案存放位址)拿到 IndexWriter 。
這個類用于建立一個新的索引并且把文檔 加到已有的索引中去。他可以提供對索引的寫入操作,但不能進行讀取或搜尋。
構造方法:
其核心API有:org.apache.lucene.index.IndexWriter.IndexWriter(Directory,IndexWriterConfig)
org.apache.lucene.index.IndexWriter.addDocument(Iterable<? extends IndexableField>) //添加文檔 org.apache.lucene.index.IndexWriter.updateDocuments(Term, Iterable<? extends Iterable<? extends IndexableField>>) //新文檔 org.apache.lucene.index.IndexWriter.tryDeleteDocument(IndexReader, int) //删除文檔 org.apache.lucene.index.IndexWriter.deleteDocuments(Term...) //删除含有詞條的文檔
Directory 是索引的存放位置,是個抽象類。具體的子類提供特定的存儲索引的位址 。
FSDirectory 将索引存放在指定的磁盤中,RAMDirectory ·将索引存放在記憶體中 。
FSDirectory 的建立:
RAMDirectory 建立:org.apache.lucene.store.FSDirectory.open(Path) //建立索引庫 org.apache.lucene.store.FSDirectory.listAll(Path) //列舉索引庫下的檔案
org.apache.lucene.store.RAMDirectory.RAMDirectory()
Analyzer 分詞器,在文本被索引之前,需要經過分詞器處理,他負責從将被索引的文檔中提取詞彙單元,并剔除剩下的無用資訊(停止詞彙),分詞器十分關鍵,因為不同的分詞器,解析相同的文檔結果會有很大的不同,預設标準分詞器(英文)。
Analyzer 是一個抽象類,是所有分詞器的基類。他通過TokenStream類似一種很好的方式,将文本逐字。
常用的分詞器:
org.apache.lucene.analysis.standard.StandardAnalyzer //标準粉瓷器 org.apache.lucene.analysis.core.SimpleAnalyzer //簡單分詞器 org.wltea.analyzer.lucene.IKAnalyzer //IK分詞器
2.2 搜尋過程中的核心類
IndexSearcher 、Term、Query、TermQuery、TopDocs
IndexSearcher 調用它的search方法,用于搜尋IndexWriter 所建立的索引。
構造方法:
常用API:org.apache.lucene.search.IndexSearcher.IndexSearcher(IndexReader)
org.apache.lucene.search.IndexSearcher.search(Query, int) //進行搜尋傳回評分較高的前n個文檔 org.apache.lucene.search.IndexSearcher.searchAfterScoreDoc, Query, int) org.apache.lucene.search.IndexSearcher.search(Query, int, Sort)
Term 使用于搜尋的一個基本單元。
Query Lucene 中含有多種查詢(Query)子類。
比如,TermQuery(單詞條查詢)、BooleanQuery(布爾查詢)、PhraseQuery(短語搜尋)、PrefixQuery(字首搜尋)等。它們用于查詢條件的限定其中TermQuery 是Lucene提供的最基本的查詢類型,也是最簡單的,它主要用來比對在指定的域(Field)中包含了特定項(Term)的文檔。
TopDocs 是一個存放有序搜尋結果指針的簡單容器,在這裡搜尋的結果是指比對一個查詢條件的一系列的文檔。
-
檢視Lucene的分詞結果的工具----Luke
luke 各版本的下載下傳git位址: https://github.com/DmitryKey/luke/releases
注意:不要将該工具放到中文目錄下
具體應用:
打開索引庫目錄/檢視詞條/進行搜尋:

lucene 入門 Demo
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>5.3.1</version>
</dependency>
目的:索引資料目錄,在指定目錄生成索引檔案
-
1、構造方法 執行個體化 IndexWriter
u 擷取索引檔案存放位址對象
u 擷取輸出流
設定輸出流的對應配置
給輸出流配置設定分詞器
- 2、關閉索引輸出流
- 3、索引指定路徑下的所有檔案
- 4、索引指定的檔案
- 5、擷取文檔(索引檔案中包含的重要資訊,key-value的形式)
- 6、測試
建立索引
IndexCreate
package com.dj.lucene;
import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
/**
* 建立索引
*
* 配合Demo1.java進行lucene的helloword實作
* @author Administrator
*
*/
public class IndexCreate {
private IndexWriter indexWriter;
/**
* 1、構造方法 執行個體化IndexWriter
* @param indexDir
* @throws Exception
*/
public IndexCreate(String indexDir) throws Exception{
// 擷取索引檔案的存放位址對象
FSDirectory dir = FSDirectory.open(Paths.get(indexDir));
// 标準分詞器(針對英文)
Analyzer analyzer = new StandardAnalyzer();
// 索引輸出流配置對象
IndexWriterConfig conf = new IndexWriterConfig(analyzer); //分詞器包裝
indexWriter = new IndexWriter(dir, conf);//生成IndexWriter
}
/**
* 2、關閉索引輸出流
* @throws Exception
*/
public void closeIndexWriter() throws Exception{
indexWriter.close();//關閉流的時候,自動存放到索引目錄檔案中
}
/**
* 3、索引指定路徑下的所有檔案
* @param dataDir
* @return
* @throws Exception
*/
public int index(String dataDir) throws Exception{
File[] files = new File(dataDir).listFiles();//擷取索引檔案夾裡的所有檔案
for (File file : files) {//周遊
indexFile(file);//索引單個檔案 一個一個來
}
return indexWriter.numDocs();
}
/**
* 4、索引單個指定的檔案
* @param file
* @throws Exception
*/
private void indexFile(File file) throws Exception{
System.out.println("被索引檔案的全路徑:"+file.getCanonicalPath());
Document doc = getDocument(file);//一個檔案對應一個Document
indexWriter.addDocument(doc);//添加到流
}
/**
* 5、擷取文檔(索引檔案中包含的重要資訊,key-value的形式)
* @param file
* @return
* @throws Exception
*/
private Document getDocument(File file) throws Exception{
Document doc = new Document();//索引檔案 doc
doc.add(new TextField("contents", new FileReader(file)));//搜尋傳入的關鍵字 根據這個搜尋
// Field.Store.YES是否存儲到硬碟
doc.add(new TextField("fullPath", file.getCanonicalPath(),Field.Store.YES));
doc.add(new TextField("fileName", file.getName(),Field.Store.YES));
return doc;
}
}
使用索引
從索引檔案中拿資料
- 1、擷取輸入流(通過dirReader)
- 2、擷取索引搜尋對象(通過輸入流來拿)
- 3、擷取查詢對象(通過查詢解析器來擷取,解析器是通過分詞器擷取)
- 4、擷取包含關鍵字排前面的文檔對象集合
- 5、可以擷取對應文檔的内容
IndexUse
package com.dj.lucene;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
/**
* 索引使用
*
* 配合Demo2.java進行lucene的helloword實作
* @author Administrator
*
*/
public class IndexUse {
/**
* 通過關鍵字在索引目錄中查詢
* @param indexDir 索引檔案所在目錄
* @param q 關鍵字
*/
public static void search(String indexDir, String q) throws Exception{
FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));//目錄對象
// 注意:索引輸入流不是new出來的,是通過目錄讀取工具類打開的
IndexReader indexReader = DirectoryReader.open(indexDirectory);//讀取目錄對象打開索引輸入流
// 擷取索引搜尋對象
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer();//标準分詞器
QueryParser queryParser = new QueryParser("contents", analyzer);//查詢解析器-解析索引檔案 contents--輸入查詢搜尋的内容
// 擷取符合關鍵字的查詢對象
Query query = queryParser.parse(q);
long start=System.currentTimeMillis();
// 擷取關鍵字出現的前十次資料
TopDocs topDocs = indexSearcher.search(query , 10);//資料集合
long end=System.currentTimeMillis();
System.out.println("比對 "+q+" ,總共花費"+(end-start)+"毫秒"+"查詢到"+topDocs.totalHits+"個記錄");
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {//周遊資料集合
int docID = scoreDoc.doc;
// 索引搜尋對象通過文檔下标擷取文檔
Document doc = indexSearcher.doc(docID);//拿到Document
System.out.println("通過索引檔案:"+doc.get("fullPath")+"拿資料");
}
indexReader.close();
}
}
建構索引
package com.dj.lucene;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;
/**
* 建構索引
*
* 對索引的增 删改(重點)
* @author Administrator
*
*/
public class Demo3 {
private String ids[]={"1","2","3"};
private String citys[]={"qingdao","nanjing","shanghai"};
private String descs[]={
"Qingdao is a beautiful city.",
"Nanjing is a city of culture.",
"Shanghai is a bustling city."
};
private FSDirectory dir;//索引檔案目錄
/**
* 每次都生成索引檔案
* @throws Exception
*/
@Before
public void setUp() throws Exception {
dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
//獲得IndexWriter
IndexWriter indexWriter = getIndexWriter();
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Field.Store.YES));
doc.add(new StringField("city", citys[i], Field.Store.YES));
doc.add(new TextField("desc", descs[i], Field.Store.NO));
indexWriter.addDocument(doc);
}
indexWriter.close();//關閉流
}
/**
* 擷取索引輸出流
* @return
* @throws Exception
*/
private IndexWriter getIndexWriter() throws Exception{
Analyzer analyzer = new StandardAnalyzer();//分詞器
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
return new IndexWriter(dir, conf );
}
/**
* 測試寫了幾個索引檔案
* @throws Exception
*/
@Test
public void getWriteDocNum() throws Exception {
IndexWriter indexWriter = getIndexWriter();
System.out.println("索引目錄下生成"+indexWriter.numDocs()+"個索引檔案");
}
/**
* 删除索引,但資料還在,document還在
*
* 打上标記,該索引實際并未删除
* @throws Exception
*/
@Test
public void deleteDocBeforeMerge() throws Exception {
IndexWriter indexWriter = getIndexWriter();
System.out.println("最大文檔數:"+indexWriter.maxDoc());
indexWriter.deleteDocuments(new Term("id", "1"));
indexWriter.commit();
System.out.println("最大文檔數:"+indexWriter.maxDoc());
System.out.println("實際文檔數:"+indexWriter.numDocs());
indexWriter.close();
}
/**
* 資料删除
*
* 對應索引檔案已經删除,但是該版本的分詞會保留
* @throws Exception
*/
@Test
public void deleteDocAfterMerge() throws Exception {
// https://blog.csdn.net/asdfsadfasdfsa/article/details/78820030
// org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine:indexWriter是單例的、線程安全的,不允許打開多個。
IndexWriter indexWriter = getIndexWriter();
System.out.println("最大文檔數:"+indexWriter.maxDoc());
indexWriter.deleteDocuments(new Term("id", "1"));
indexWriter.forceMergeDeletes(); //強制删除
indexWriter.commit();
System.out.println("最大文檔數:"+indexWriter.maxDoc());
System.out.println("實際文檔數:"+indexWriter.numDocs());
indexWriter.close();
}
/**
* 測試更新索引
* @throws Exception
*/
@Test
public void testUpdate()throws Exception{
IndexWriter writer=getIndexWriter();
Document doc=new Document();
doc.add(new StringField("id", "1", Field.Store.YES));
doc.add(new StringField("city","qingdao",Field.Store.YES));
doc.add(new TextField("desc", "dsss is a city.", Field.Store.NO));
writer.updateDocument(new Term("id","1"), doc);
writer.close();
}
}
文檔域權重
package com.dj.lucene;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;
/**
* 文檔域權重
*
* @author Administrator
*
*/
public class Demo4 {
private String ids[]={"1","2","3","4"};
private String authors[]={"Jack","Marry","John","Json"};
private String positions[]={"accounting","technician","salesperson","boss"};
private String titles[]={"Java is a good language.","Java is a cross platform language","Java powerful","You should learn java"};
private String contents[]={
"If possible, use the same JRE major version at both index and search time.",
"When upgrading to a different JRE major version, consider re-indexing. ",
"Different JRE major versions may implement different versions of Unicode,",
"For example: with Java 1.4, `LetterTokenizer` will split around the character U+02C6,"
};
private Directory dir;//索引檔案目錄
/**
* 每次都生成索引檔案
* @throws Exception
*/
@Before
public void setUp()throws Exception {
dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo3\\indexDir"));
//獲得IndexWriter
IndexWriter writer = getIndexWriter();
for (int i = 0; i < authors.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Field.Store.YES));
doc.add(new StringField("author", authors[i], Field.Store.YES));
doc.add(new StringField("position", positions[i], Field.Store.YES));
TextField textField = new TextField("title", titles[i], Field.Store.YES);
// Json投錢做廣告,把排名刷到第一了
//if("boss".equals(positions[i])) {
// textField.setBoost(2f);//設定權重,預設為1
//}
doc.add(textField);
// TextField會分詞,StringField不會分詞
doc.add(new TextField("content", contents[i], Field.Store.NO));
writer.addDocument(doc);
}
writer.close();//關閉流
}
/**
* 擷取索引輸出流
* @return
* @throws Exception
*/
private IndexWriter getIndexWriter() throws Exception{
Analyzer analyzer = new StandardAnalyzer();//分詞器
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
return new IndexWriter(dir, conf);
}
@Test
public void index() throws Exception{
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
String fieldName = "title";
String keyWord = "java";
Term t = new Term(fieldName, keyWord);
Query query = new TermQuery(t);
TopDocs hits = searcher.search(query, 10);
System.out.println("關鍵字:‘"+keyWord+"’命中了"+hits.totalHits+"次");
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.get("author"));
}
}
}
這是正常排名:
Json投錢打廣告後的排名:
特定項搜尋
package com.dj.lucene;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;
/**
* 特定項搜尋
*
* 查詢表達式(queryParser)
* @author Administrator
*
*/
public class Demo5 {
@Before
public void setUp() {
// 索引檔案将要存放的位置
String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
// 資料源位址
String dataDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4\\data";
IndexCreate ic = null;//建立索引 ic
try {
ic = new IndexCreate(indexDir);
long start = System.currentTimeMillis();
int num = ic.index(dataDir);
long end = System.currentTimeMillis();
System.out.println("檢索指定路徑下" + num + "個檔案,一共花費了" + (end - start) + "毫秒");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
ic.closeIndexWriter();
} catch (Exception e) {
e.printStackTrace();
}
}
}
/**
* 特定項搜尋
*/
@Test
public void testTermQuery() {
// 索引檔案将要存放的位置
String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
String fld = "contents";//搜尋關鍵字
String text = "indexformattoooldexception";//搜尋片段名
// 特定項片段名和關鍵字
Term t = new Term(fld , text);
TermQuery tq = new TermQuery(t );
try {
FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));
// 注意:索引輸入流不是new出來的,是通過目錄讀取工具類打開的
IndexReader indexReader = DirectoryReader.open(indexDirectory);
// 擷取索引搜尋對象
IndexSearcher is = new IndexSearcher(indexReader);
TopDocs hits = is.search(tq, 100);
// System.out.println(hits.totalHits);
for(ScoreDoc scoreDoc: hits.scoreDocs) {
Document doc = is.doc(scoreDoc.doc);
System.out.println("檔案"+doc.get("fullPath")+"中含有該關鍵字");
}
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* 查詢表達式(queryParser)
*/
@Test
public void testQueryParser() {
// 索引檔案将要存放的位置
String indexDir = "D:\\Software\\Softwarepath\\lucene\\demo\\demo4";
// 擷取查詢解析器(通過哪種分詞器去解析哪種片段)
QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer());
try {
FSDirectory indexDirectory = FSDirectory.open(Paths.get(indexDir));//擷取索引檔案目錄
// 注意:索引輸入流不是new出來的,是通過目錄讀取工具類打開的
IndexReader indexReader = DirectoryReader.open(indexDirectory);//通過目錄讀取工具類打開索引輸入流
// 擷取索引搜尋對象
IndexSearcher is = new IndexSearcher(indexReader);
// 由解析器去解析對應的關鍵字
TopDocs hits = is.search(queryParser.parse("indexformattoooldexception") , 100);//查詢的關鍵字:indexformattoooldexception
for(ScoreDoc scoreDoc: hits.scoreDocs) {
Document doc = is.doc(scoreDoc.doc);
System.out.println("檔案"+doc.get("fullPath")+"中含有該關鍵字");
}
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
}
組合查詢
package com.dj.lucene;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;
/**
* 指定數字範圍查詢
*
* 指定字元串開頭字母查詢(prefixQuery)
*
* @author Administrator
*
*/
public class Demo6 {
private int ids[]={1,2,3};
private String citys[]={"qingdao","nanjing","shanghai"};
private String descs[]={
"Qingdao is a beautiful city.",
"Nanjing is a city of culture.",
"Shanghai is a bustling city."
};
private FSDirectory dir;
/**
* 每次都生成索引檔案
* @throws Exception
*/
@Before
public void setUp() throws Exception {
dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
IndexWriter indexWriter = getIndexWriter();
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new IntField("id", ids[i], Field.Store.YES));
doc.add(new StringField("city", citys[i], Field.Store.YES));
doc.add(new TextField("desc", descs[i], Field.Store.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
}
/**
* 擷取索引輸出流
* @return
* @throws Exception
*/
private IndexWriter getIndexWriter() throws Exception{
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
return new IndexWriter(dir, conf );
}
/**
* 指定數字範圍查詢
* @throws Exception
*/
@Test
public void testNumericRangeQuery()throws Exception{
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(reader);
NumericRangeQuery<Integer> query=NumericRangeQuery.newIntRange("id", 1, 2, true, true);//閉區間 [1,2]
TopDocs hits=is.search(query, 10);
for(ScoreDoc scoreDoc:hits.scoreDocs){
Document doc=is.doc(scoreDoc.doc);
System.out.println(doc.get("id"));
System.out.println(doc.get("city"));
System.out.println(doc.get("desc"));
}
}
/**
* 指定字元串開頭字母查詢(prefixQuery)
* @throws Exception
*/
@Test
public void testPrefixQuery()throws Exception{
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(reader);
PrefixQuery query=new PrefixQuery(new Term("city","n"));
TopDocs hits=is.search(query, 10);
for(ScoreDoc scoreDoc:hits.scoreDocs){
Document doc=is.doc(scoreDoc.doc);
System.out.println(doc.get("id"));
System.out.println(doc.get("city"));
System.out.println(doc.get("desc"));
}
}
/**
* 組合查詢
*
* 常用這個查詢
* @throws Exception
*/
@Test
public void testBooleanQuery()throws Exception{
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(reader);
NumericRangeQuery<Integer> query1=NumericRangeQuery.newIntRange("id", 1, 2, true, true);
PrefixQuery query2=new PrefixQuery(new Term("city","n"));//區分大小寫
BooleanQuery.Builder booleanQuery=new BooleanQuery.Builder();//建構查詢條件
booleanQuery.add(query1,BooleanClause.Occur.MUST);
booleanQuery.add(query2,BooleanClause.Occur.MUST);
TopDocs hits=is.search(booleanQuery.build(), 10);//搜尋查詢
for(ScoreDoc scoreDoc:hits.scoreDocs){
Document doc=is.doc(scoreDoc.doc);
System.out.println(doc.get("id"));
System.out.println(doc.get("city"));
System.out.println(doc.get("desc"));
}
}
}
lucene 工具類
LuceneUtil
package com.dj.blog.util;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryTermScorer;
import org.apache.lucene.search.highlight.Scorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
/**
* lucene工具類
*
* @author Administrator
*
*/
public class LuceneUtil {
/**
* 擷取索引檔案存放的檔案夾對象
*
* @param path
* @return
*/
public static Directory getDirectory(String path) {
Directory directory = null;
try {
directory = FSDirectory.open(Paths.get(path));
} catch (IOException e) {
e.printStackTrace();
}
return directory;
}
/**
* 索引檔案存放在記憶體
*
* 一般不用
* @return
*/
public static Directory getRAMDirectory() {
Directory directory = new RAMDirectory();
return directory;
}
/**
* 檔案夾讀取對象
*
* @param directory
* @return
*/
public static DirectoryReader getDirectoryReader(Directory directory) {
DirectoryReader reader = null;
try {
reader = DirectoryReader.open(directory);
} catch (IOException e) {
e.printStackTrace();
}
return reader;
}
/**
* 檔案索引對象
*
* @param reader
* @return
*/
public static IndexSearcher getIndexSearcher(DirectoryReader reader) {
IndexSearcher indexSearcher = new IndexSearcher(reader);
return indexSearcher;
}
/**
* 寫入索引對象
*
* @param directory
* @param analyzer
* @return
*/
public static IndexWriter getIndexWriter(Directory directory, Analyzer analyzer)
{
IndexWriter iwriter = null;
try {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
// Sort sort=new Sort(new SortField("content", Type.STRING));
// config.setIndexSort(sort);//排序
config.setCommitOnClose(true);
// 自動送出
// config.setMergeScheduler(new ConcurrentMergeScheduler());
// config.setIndexDeletionPolicy(new
// SnapshotDeletionPolicy(NoDeletionPolicy.INSTANCE));
iwriter = new IndexWriter(directory, config);
} catch (IOException e) {
e.printStackTrace();
}
return iwriter;
}
/**
* 關閉索引檔案生成對象以及檔案夾對象
*
* @param indexWriter
* @param directory
*/
public static void close(IndexWriter indexWriter, Directory directory) {
if (indexWriter != null) {
try {
indexWriter.close();
} catch (IOException e) {
indexWriter = null;
}
}
if (directory != null) {
try {
directory.close();
} catch (IOException e) {
directory = null;
}
}
}
/**
* 關閉索引檔案讀取對象以及檔案夾對象
*
* @param reader
* @param directory
*/
public static void close(DirectoryReader reader, Directory directory) {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
reader = null;
}
}
if (directory != null) {
try {
directory.close();
} catch (IOException e) {
directory = null;
}
}
}
/**
* 高亮标簽
*
* @param query
* @param fieldName
* @return
*/
public static Highlighter getHighlighter(Query query, String fieldName)
{
Formatter formatter = new SimpleHTMLFormatter("<span style='color:red'>", "</span>");
Scorer fragmentScorer = new QueryTermScorer(query, fieldName);
Highlighter highlighter = new Highlighter(formatter, fragmentScorer);
highlighter.setTextFragmenter(new SimpleFragmenter(200));
return highlighter;
}
}
建構lucene索引:
package com.dj.blog.web;
import java.io.IOException;
import java.nio.file.Paths;
import java.sql.SQLException;
import java.util.List;
import java.util.Map;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import com.dj.blog.dao.BlogDao;
import com.dj.blog.util.PropertiesUtil;
/**
* 建構lucene索引
* @author Administrator
* 1。建構索引 IndexWriter
* 2、讀取索引檔案,擷取命中片段
* 3、使得命中片段高亮顯示
*
*/
public class IndexStarter {
private static BlogDao blogDao = new BlogDao();
public static void main(String[] args) {
IndexWriterConfig conf = new IndexWriterConfig(new SmartChineseAnalyzer());
Directory d;
IndexWriter indexWriter = null;
try {
d = FSDirectory.open(Paths.get(PropertiesUtil.getValue("indexPath")));
indexWriter = new IndexWriter(d , conf );
// 為資料庫中的所有資料建構索引
List<Map<String, Object>> list = blogDao.list(null, null);
for (Map<String, Object> map : list) {
Document doc = new Document();
doc.add(new StringField("id", (String) map.get("id"), Field.Store.YES));
// TextField用于對一句話分詞處理 java教育訓練機構
doc.add(new TextField("title", (String) map.get("title"), Field.Store.YES));
doc.add(new StringField("url", (String) map.get("url"), Field.Store.YES));
indexWriter.addDocument(doc);
}
} catch (IOException e) {
e.printStackTrace();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}finally {
try {
if(indexWriter!= null) {
indexWriter.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
中文分詞器及高亮顯示
package com.dj.lucene;
import java.io.StringReader;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.store.FSDirectory;
import org.junit.Before;
import org.junit.Test;
/**
* 中文分詞器
*
* 高亮顯示
*
* @author 86182
*
*/
public class Demo7 {
private Integer ids[] = { 1, 2, 3 };
private String citys[] = { "青島", "南京", "上海" };
// private String descs[]={
// "青島是個美麗的城市。",
// "南京是個有文化的城市。",
// "上海市個繁華的城市。"
// };
private String descs[] = { "青島是個美麗的城市。",
"南京是一個文化的城市南京,簡稱甯,是江蘇省會,地處中國東部地區,長江下遊,瀕江近海。全市下轄11個區,總面積6597平方公裡,2013年建成區面積752.83平方公裡,常住人口818.78萬,其中城鎮人口659.1萬人。[1-4] “江南佳麗地,金陵帝王州”,南京擁有着6000多年文明史、近2600年建城史和近500年的建都史,是中國四大古都之一,有“六朝古都”、“十朝都會”之稱,是中華文明的重要發祥地,曆史上曾數次庇佑華夏之正朔,長期是中國南方的政治、經濟、文化中心,擁有厚重的文化底蘊和豐富的曆史遺存。[5-7] 南京是國家重要的科教中心,自古以來就是一座崇文重教的城市,有“天下文樞”、“東南第一學”的美譽。截至2013年,南京有高等院校75所,其中211高校8所,僅次于北京上海;國家重點實驗室25所、國家重點學科169個、兩院院士83人,均居中國第三。[8-10]",
"上海市個繁華的城市。" };
private FSDirectory dir;
/**
* 每次都生成索引檔案
*
* @throws Exception
*/
@Before
public void setUp() throws Exception {
dir = FSDirectory.open(Paths.get("D:\\Software\\Softwarepath\\lucene\\demo\\demo2\\indexDir"));
IndexWriter indexWriter = getIndexWriter();
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new IntField("id", ids[i], Field.Store.YES));
doc.add(new StringField("city", citys[i], Field.Store.YES));
doc.add(new TextField("desc", descs[i], Field.Store.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
}
/**
* 擷取索引輸出流
*
* @return
* @throws Exception
*/
private IndexWriter getIndexWriter() throws Exception {
// Analyzer analyzer = new StandardAnalyzer();//預設分詞器
Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞器
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
return new IndexWriter(dir, conf);
}
/**
* luke檢視索引生成
*
* @throws Exception
*/
@Test
public void testIndexCreate() throws Exception {
}
/**
* 測試高亮
*
* @throws Exception
*/
@Test
public void testHeight() throws Exception {
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
QueryParser parser = new QueryParser("desc", analyzer);
// Query query = parser.parse("南京文化");
Query query = parser.parse("南京文明");
TopDocs hits = searcher.search(query, 100);
// 查詢得分項
QueryScorer queryScorer = new QueryScorer(query);
// 得分項對應的内容片段
SimpleSpanFragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
// 高亮顯示的樣式
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<span color='red'><b>", "</b></span>");
// 高亮顯示對象
Highlighter highlighter = new Highlighter(htmlFormatter, queryScorer);
// 設定需要高亮顯示對應的内容片段
highlighter.setTextFragmenter(fragmenter);
for (ScoreDoc scoreDoc : hits.scoreDocs) {
//索引搜尋對象通過文檔下标擷取文檔
Document doc = searcher.doc(scoreDoc.doc);
String desc = doc.get("desc");
if (desc != null) {
// tokenstream是從doucment的域(field)中抽取的一個個分詞而組成的一個資料流,用于分詞。
TokenStream tokenStream = analyzer.tokenStream("desc", new StringReader(desc));
System.out.println("高亮顯示的片段:" + highlighter.getBestFragment(tokenStream, desc));
}
System.out.println("所有内容:" + desc);
}
}
}
标題為空時:從資料庫查找資料
标題不為空時:從索引目錄查找
注意:索引目錄下一定要有索引檔案,否則會報錯!!!
no segments* file found in [email protected]:\Software\Softwarepath\lucene\demo\text lockFactory=[email protected]: files: []