ææ¯éå
该é¢å已被Luceneç¬å ï¼å ä¹æ ç«äºå¯¹æã
ä½æ¯ç´æ¥ä½¿ç¨Luceneé常å¤æï¼å æ¤åºç°äºä¸¤ä¸ªç»ä»¶ï¼ä¸æ¯solrï¼äºæ¯elastic searchï¼elastic searchæµè¡åº¦æ´é«ï¼ä½å¹¶éå¨ææåºç¨åºæ¯å ä¼ï¼å¯¹äºç´¢å¼åºå·²å»ºç«çæ åµä¸ï¼å¦å°æ人ç个人åå ¬çµèææææ¡£è¿è¡å ¨ææç´¢ï¼è¿ç§æ åµä¸ï¼solrçæ§è½è¦ææ¾ä¼äºesï¼ä½å¯¹äºå¨ææ°æ®çä¸ææå ¥ç´¢å¼åºï¼å¦äºèç½åºç¨ï¼åesæ§è½ææ¾ä¼äºsolrã
对äºä¼ä¸æ档管çç³»ç»èè¨ï¼ææ¡£å¤äºå¨æååä¸ï¼ä½ååé¢çç¸å¯¹äºèç½åºç¨é¢çè¾ä½ï¼solråesé½å¯ä»¥ä½¿ç¨ï¼èèå°æµè¡ç¨åº¦ï¼æç»éæ©çes
éææ¹å¼éæ©
å¨å·²ç»éæ©esçæ åµä¸ï¼ä¸SpringBootæ´åï¼æå ç§ææ¯æ¹æ¡ï¼
1.transport-apiï¼springbootçæ¬ä¸åï¼transport-apiä¸åï¼ä¸è½éé ä¸åçesçæ¬ï¼7.xä¸å»ºè®®ä½¿ç¨ï¼8以ååºå¼
2.JestClientï¼éå®æ¹ï¼æ´æ°æ ¢ï¼
3.RestTemplateï¼æ¨¡æåéHttp请æ±ï¼Eså¾å¤æä½éè¦èªå·±å°è£ ï¼éº»ç¦ï¼
4.HttpClientï¼åä¸ï¼
5.ElasticSearch-Rest-Clientï¼å®æ¹RestClientï¼å°è£ äºå¾å¤ESæä½ï¼
ä¼ç¹ï¼APIå±æ¬¡åæï¼ä¸æç®åï¼
缺ç¹ï¼
- å¾å¤å°æ¹éè¦æ¼æ¥Jsonå符串ï¼å¨javaä¸æ¼æ¥å符串æå¤ææä½ åºè¯¥æç
- éè¦èªå·±æ对象åºåå为jsonåå¨
- æ¥è¯¢å°ç»æä¹éè¦èªå·±ååºåå为对象
6.Spring Data Elasticsearch
æ¯æSpringçåºäº@Configurationçjavaé ç½®æ¹å¼ï¼æè XMLé ç½®æ¹å¼
æä¾äºç¨äºæä½ESç便æ·å·¥å ·ç±»ElasticsearchTemplateãå æ¬å®ç°ææ¡£å°POJOä¹é´çèªå¨æºè½æ å°ã
å©ç¨Springçæ°æ®è½¬æ¢æå¡å®ç°çåè½ä¸°å¯ç对象æ å°
åºäºæ³¨è§£çå æ°æ®æ å°æ¹å¼ï¼èä¸å¯æ©å±ä»¥æ¯ææ´å¤ä¸åçæ°æ®æ ¼å¼
æ ¹æ®æä¹ å±æ¥å£èªå¨çæ对åºå®ç°æ¹æ³ï¼æ é人工ç¼ååºæ¬æä½ä»£ç ï¼ç±»ä¼¼mybatisï¼æ ¹æ®æ¥å£èªå¨å¾å°å®ç°ï¼ãå½ç¶ï¼ä¹æ¯æ人工å®å¶æ¥è¯¢ã
1-4ç§éææ¹å¼ç¼ºç¹å¾ææ¾ï¼ä¸äºèèã第5ç§æ¯eså®æ¹èªå¸¦çç±»åºï¼æç¨æ§è¾å¥½ï¼ä½ä»åå¨ä¸äºç¼ºç¹ï¼ç¬¬6ç§åæ¯springæä¾çæå¡ï¼èèå°springè¿ä¸ªå¤§å®¶æï¼æç»éæ©Spring Data Elasticsearchã
éæ±æ¦è¿°
éæ±ä¸ï¼éè¦å®ç°å ¨ææç´¢ç对象æ¯ææ¡£ï¼æç´¢æ件夹çå称æä¹æéï¼ä¸èèï¼ï¼è¿è¡å ¨ææç´¢æ¶ï¼åä¸æç´¢çæ2项信æ¯ï¼ææ¡£å称åææ¡£å 容ï¼åä¸æåºçæ¯ç¸å ³åº¦ãå建æ¶é´ãæ´æ°æ¶é´ã
æ件å 容读å
å建索å¼ï¼éè¦è¯»åæ件å 容ã
æ件类å
ç²ç¥èèï¼è½è¿è¡å ¨ææç´¢ç主è¦æ¯officeãææ¬ã代ç ãpdfãmarkdownè¿å ç§å¸¸è§æ ¼å¼ï¼å¾çãè§é¢ãé³é¢åå缩å ä¸å¤çï¼éè¦å¨é ç½®æ件ä¸æç¡®å®ä¹ï¼ä»¥ä¾¿ä½¿ç¨åéç读åå¨è¿è¡å¤çã
officeï¼âdocxâ, âwpsâ, âdocâ, âxlsâ, âxlsxâ, âpptâ, âpptxâ
å¾çï¼âjpgâ, âjpegâ, âpngâ, âgifâ, âbmpâ, âicoâ, ârawâ
ææ¬ï¼txt,html,htm,asp,jsp,xml,json,properties,md,gitignore,log,java,py,c,cpp,sql,sh,bat,m,bas,prg,cmd
代ç ï¼âjavaâ, âcâ, âphpâ, âgoâ, âpythonâ, âpyâ, âjsâ, âhtmlâ, âftlâ, âcssâ, âluaâ, âshâ, ârbâ, âymlâ, âjsonâ, âhâ, âcppâ, âcsâ, âaspxâ, âjspâ
å缩å ï¼ârarâ, âzipâ, âjarâ, â7-zipâ, âtarâ, âgzipâ, â7zâ
å¤åªä½ï¼mp3,wav,mp4
markdownï¼md
xml:xml
pdf:pdf
flv:flv
cad:cad
æ件ç¼ç
对äºææ¬ç±»çæ件ï¼æ件ç¼ç å¯è½ä¼æå¤ç§ï¼å¦UTF-8ï¼UTF-16,GB2312ï¼GBKï¼GB18030çï¼è¯»åå 容æ¶éè¦ä¿è¯ç¼ç æ¹å¼æ£ç¡®ï¼å¦åä¼äº§çä¹±ç ã
å¯ä»¥å为两类ï¼ä¸ç±»å¸¦bomï¼å³æ件åå 个åè代表ç¼ç æ¹å¼ï¼å¦å¤ä¸ç±»åä¸å¸¦bomï¼éè¦æ ¹æ®åèå 容å¤æ
以常è§çwindowsè®°äºæ¬çæçtxtæ件为ä¾ï¼ä½¿ç¨ä»¥ä¸ä»£ç å¯æ£ç¡®è¯»åï¼å·²éªè¯ï¼
/**
* å¤ææ件çç¼ç æ ¼å¼
* @param fileName :file
* @return æ件ç¼ç æ ¼å¼
* @throws Exception
*/
public static String codeString(File fileName) throws Exception{
BufferedInputStream bin = new BufferedInputStream(
new FileInputStream(fileName));
int p = (bin.read() << 8) + bin.read();
String code = null;
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
default:
code = "GBK";
}
IOUtils.closeQuietly(bin);
return code;
}
è¿ç§å¤ææ¹å¼ååç对äºä¸å¸¦bomçæ件æ è½ä¸ºåã
çä¸å»æ¯ä¸ªå°äºï¼ä½æ¯çè¦èªå·±åï¼å·¥ä½éé常å¯è§â¦â¦
ç½ä¸æ¾äºä¸ç¬¬ä¸æ¹ç±»åºï¼æ以ä¸å 个ï¼
cpdetectorï¼åºäºç»è®¡å¦åççï¼ä¸ä¿è¯å®å ¨æ£ç¡®ï¼api使ç¨ç¸å½ç¹ç
public static String getFileEncode(String filePath) {
String charsetName = null;
try {
File file = new File(filePath);
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
java.nio.charset.Charset charset = null;
charset = detector.detectCodepage(file.toURI().toURL());
if (charset != null) {
charsetName = charset.name();
} else {
charsetName = "UTF-8";
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return charsetName;
}
icu4jï¼ibmåºåçï¼çäºä¸ææ°çæ¬æ¯2020å¹´12ææ´æ°çï¼åºè¯¥è¿ç®å¯é
<dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
<version>59.2</version>
</dependency>
public static String getFileEncoding(File file) {
//é»è®¤è®¾ç½®ä¸ºutf-8
String encoding = "utf-8";
try {
Path path = Paths.get(file.getPath());
byte[] data = Files.readAllBytes(path);
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch match = detector.detect();
if (match != null) {
encoding = match.getName();
}
} catch (IOException exception) {
//读åæ件失败ï¼ä¸å¤çå¼å¸¸
}
return encoding;
}
æ¿è®°äºæ¬è¯äºä¸ï¼å¦å为å ç§æ ¼å¼ï¼åç°ç±»åºè¾åºçå¤æä¼äºä¸é¢æ¿bomç®æå¤æ读åï¼æç»éæ©ä½¿ç¨è¯¥ç±»åºå¤çã
å 容解æ
ææ¬å 容åªè¦ä¿è¯ç¼ç æ ¼å¼ï¼ç´æ¥è¯»åå³å¯ï¼éè¦åå 容解æç主è¦æ¯äºè¿å¶æ ¼å¼çpdfåofficeç³»åææ¡£ã
éç¨apache poiç±»åºæ¥è¯»åWordãExcelåPowerPointè¿ä¸ç±»æ件ï¼æ³¨æ2003çæ¬å以ä¸ï¼å2007çæ¬å以ä¸éè¦åå«å¤ç
<!--读åWord/Excel/PowerPointæ件å
容-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.17</version>
</dependency>
éç¨apache pdfåºæ¥è¯»åpdfå 容
<!--读åpdfæ件å
容-->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.3</version>
</dependency>
ç´¢å¼å建
读åæ件å 容ï¼å建索å¼æ¯å¸¸è§æä½ï¼è¿åææ¯ä¸æ²¡ä»ä¹é®é¢ï¼ä¸è¿ä»å¤çæµç¨èèï¼ç¨æ·ä¸ä¼ æ件ï¼ç³»ç»è¿è¡åå¨åååºè¡¨æ¯ä¸»æµç¨ï¼ä¸å¿ é¡»å®æï¼åºéåºç»äºæ示ï¼ä½å建索å¼å¹¶ä¸å±äºä¸»æµç¨ï¼è¯¥å·¥ä½ä¸åºè¯¥å ç¨ä¸»æµç¨çå¤çæ¶é´ï¼æ以åºè¯¥è¿è¡å¼æ¥å¤çã
å æ¤è¿ééè¿æ°çº¿ç¨æ¥è¯»åæ件å 容并å建索å¼ã
æéæ§å¶
å ¨ææç´¢ç»ä»¶Elasticsearch并ä¸æ¯æ带æ°æ®æéæç´¢ï¼å æ¤åªè½ä»æ¥è¯¢ç»æå ¥ææ¥è§£å³ï¼å®ç°æè·¯å¦ä¸ï¼
é¦å è·å1页æ°æ®ï¼ä¾å¦ï¼é¡µé¢è®°å½æ°ä¸º10ï¼ï¼ç¶åéæ¡å¤æå½åç¨æ·æ¯å¦æé¢è§æéï¼å¦æï¼åå å ¥ç»æå表ï¼
å¤æç»æå表æ¯å¦å·²å¤10æ¡ï¼å¦ä¸å¤ï¼å继ç»è¯»åä¸ä¸é¡µæ°æ®ï¼å½è¯»åçè®°å½æ°è¶ åºç»ææ»æ°ååæ¢ã
å端é»è®¤åªæ¾ç¤ºå页æ°æ®ï¼åæ¶ä¸æ¾ç¤ºå½ä¸è®°å½æ°ï¼èæ¯éè¿ç¹å»å è½½æ´å¤æ¥å è½½ä¸é¡µæ°æ®ã
åæ¶ï¼éèèå½å页æ°æ®ä¹åæ¯å¦è¿ææ°æ®ï¼ä»¥ä¾¿å端æ§å¶æ¯å¦æ¾ç¤ºå è½½æ´å¤æ示ã
æ¤å¤è¿æ个æ£æé®é¢ï¼å¸¦æéçæ°æ®æ¼æ¥é®é¢ï¼é¦æ¬¡æ¥è¯¢æ²¡é®é¢ï¼ä½æ¯åç»æ¥è¯¢ï¼ä»åªéå¼å§åæ¯é®é¢ï¼ç¸å½äºè¿éè¦å°ä¸æ¬¡æ¥è¯¢å°åªä¸é¡µè®°å½ä¸æ¥ï¼å¦ææ¯æ¬¡æ¥è¯¢è¿åä¸ä¸ªåºå®æ¡æ°ï¼å¦10æ¡ï¼ï¼ååæ°æ®æéè¿æ»¤çå½±åï¼å®¹æåºç°é¨åæ°æ®ä¸¢å¤±çæ åµï¼ä¾å¦å页æ°æ®å¤§å°æ¯10ï¼ä¸å ±æ¥è¯¢åº28æ¡æ°æ®ï¼å æ¥åº10æ¡ï¼æéè¿æ»¤åå©ä½äº8æ¡ï¼å读å第2页æ°æ®ï¼å设第11-20æ¡æ°æ®ç»è¿æéè¿æ»¤åæ5æ¡æ°æ®ï¼åå2æ¡è¡¥è¶³ï¼ä½ä¸ºç»æè¿åï¼ç¨æ·æ¤æ¶ç¹å»äºå è½½æ´å¤ï¼è¿æ¶åå端就é¢ä¸´1个麻ç¦ï¼ä»åªä¸ªå°æ¹å¼å§åæ°æ®ï¼æè½ä¸éä¸æ¼ï¼å¾ææ¾ä¸é¢ä¾å产çäºâå页æ°æ®âã
è¿ä¸æ¥æ¥çesçapiï¼åç°è¿ç§ç±»ä¼¼æç´¢å¼æçåºç¨åºæ¯ï¼esæä¾äºscroll æ¥æ¯æï¼æ¥è¯¢ç»æä¼èªå¨äº§çä¸ä¸ªscrollIdï¼ä¸æ¬¡æ¥è¯¢ä¼ å ¥è¯¥åæ°ï¼åä¼ä»è¯¥è®°å½ç»§ç»å¾ä¸æç´¢ã
ä½æ¯ï¼å¦ææ¯æ¬¡å è½½ç¸åæ°éçæ°æ®ï¼è¿æ¯é¢ä¸´å页æ°æ®çé®é¢ï¼å æ¤è°æ´å®ç°æè·¯ï¼å¨æ¥è¯¢ç»æ足éæ åµä¸ï¼æ¯æ¬¡è³å°è¿å10æ¡æ°æ®ï¼æå¤è¿å20æ¡æ°æ®ï¼ä¸»è¦åæ°æ®æéè¿æ»¤çå½±åï¼ï¼è¿ç§å®ç°æ¹å¼ï¼å¯¹äºä¸å¡ç¨æ·èè¨ï¼å¹¶æ å¤å¤§å½±åã
已解å³é®é¢
- ä»ä¹é¶æ®µå建索å¼åºåæ å°ï¼ç³»ç»å¯å¨æ¶ï¼å¤ææ¯å¦å·²åå¨ç´¢å¼åºï¼è¥ä¸åå¨ï¼åå建ï¼
å®é 并ä¸éè¦èªå·±å建索å¼ï¼ç»å®ä½ç±»å ä¸æ³¨è§£åï¼eså¼æä¼èªå¨å建索å¼åºä»¥åæ å°
- åºè¯¥ä¸ºæ件夹åææ¡£å建ç«1个索å¼åºï¼è¿æ¯å ±ç¨1个索å¼åºï¼å¨æ¥è¯¢ç¯èè¿è¡å并ï¼
å建1个ï¼es7.xçæ¬å·²ç»åºå¼äºç±»åï¼æ¯ä¸ªç´¢å¼åºåªæ1个é»è®¤çç±»åï¼å³_docï¼åªåå¨ä¸ç§æ°æ®ç»æçæ å°
è°æ´éæ±ï¼åªèèææ¡£çå ¨ææç´¢ï¼å æ¤è¯¥é®é¢å°ä¸åå¨
- 注解å å°å·²æçå®ä½å¯¹è±¡ï¼è¿æ¯å¦è¡å®ä¹ï¼
ææ¶æ²¡åç°åºå¦è¡å®ä¹çå¿ è¦æ§ï¼å æç §å¤ç¨å·²æçå®ä½å¯¹è±¡å¤ç
- éè¿æ³¨è§£å®ç°çIKåè¯æªèµ·ä½ç¨ï¼
SpringDataElasticSearchç»ä»¶èªèº«é®é¢ï¼æ ¹æ¬å°±æªè¯»åESç»ä»¶çField注解å±æ§ï¼å¨ç³»ç»å¯å¨å®æåï¼ä½¿ç¨ElasticSearchRestTemplateæ¥å£ï¼æå¨å建索å¼ã