Tika-内容分析工具包
官方網站:https://tika.apache.org/
在maven倉庫下載下傳最新版依賴 https://mvnrepository.com/artifact/org.apache.tika/tika-parsers
懶得去的同學,提供一個筆者正在使用的依賴
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.18</version>
</dependency>
提取url中的文字資訊
public class TikaDemo {
public static void main(String[] args) throws IOException, TikaException {
Tika tika = new Tika();
String s = tika.parseToString(new URL("https://www.baidu.com"));
System.out.println(s);
}
}
輸出結果:
提取word中的文字
public class TikaDemo {
public static void main(String[] args) throws IOException, TikaException {
Tika tika = new Tika();
File file = new File("文檔.docx");
String s = tika.parseToString(file);
System.out.println(s);
}
}
輸出結果:
提取excel中的文字
public class TikaDemo {
public static void main(String[] args) throws IOException, TikaException {
Tika tika = new Tika();
File file = new File("工作簿.xlsx");
String s = tika.parseToString(file);
System.out.println(s);
}
}
輸出内容:
提取pdf檔案中的文字
public class TikaDemo {
public static void main(String[] args) throws IOException, TikaException {
Tika tika = new Tika();
File file = new File("pdf檔案.pdf");
String s = tika.parseToString(file);
System.out.println(s);
}
}
輸出結果: