tika 是一個解析文檔的工具箱,可以自己判别文檔種類,再用合适的jar包 解析對應的文檔
今天遇到一個需求,把網絡上的檔案内容解析成資料。
寫爬蟲,解析頁面,對頁面進行處理,解析出檔案的url
接着就是檔案下載下傳到本地,對照tika 的demo 解析資料
可是檔案下載下傳到本地占用硬碟空間不說還會消耗磁盤io
既然需要資料隻要在記憶體中處理就好了
使用
inputStream in =response.getEntity
用Tika 解析檔案屬性和正文,但是tika 分兩步解析資料,任何一步都會改變inputStream
解決辦法一:
inputStream的複制
ByteArrayOutputStream baos = new ByteArrayOutputStream();
// Fake code simulating the copy
// You can generally do better with nio if you need...
// And please, unlike me, do something about the Exceptions :D
byte[] buffer = new byte[1024];
int len;
while ((len = entity.read(buffer)) > -1 ) {
baos.write(buffer, 0, len);
}
baos.flush();
// Open new InputStreams using the recorded bytes
// Can be repeated as many times as you wish
InputStream is1 = new ByteArrayInputStream(baos.toByteArray());
InputStream is2 = new ByteArrayInputStream(baos.toByteArray());
第二種:包裝inputStream 防止被關閉
InputStream is = null; is = getStream(); //obtain the stream CloseShieldInputStream csis = new CloseShieldInputStream(is); // call the bad function that does things it shouldn't badFunction(csis); // happiness follows: do something with the original input stream is.read();