為了友善使用者往solr中添加索引,solr為使用者提供了一個post.jar工具,使用者隻需要在指令行下運作post.jar并傳入一些參數就可以完成索引的增删改操作,對,它僅僅是一個供使用者進行solr測試的工具而已,有關post.jar的使用說明如下:

simpleposttool version 5.1.0
usage: java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
supported system properties and their defaults:
-dc=<core/collection>
-durl=<base solr update url> (overrides -dc option if specified)
-ddata=files|web|args|stdin (default=files)
-dtype=<content-type> (default=application/xml)
-dhost=<host> (default: localhost)
-dport=<port> (default: 8983)
-dauto=yes|no (default=no)
-drecursive=yes|no|<depth> (default=0)
-ddelay=<seconds> (default=0 for files, 10 for web)
-dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
-dparams="<key>=<value>[&<key>=<value>...]" (values must be url-encoded)
-dcommit=yes|no (default=yes)
-doptimize=yes|no (default=no)
-dout=yes|no (default=no)
this is a simple command line tool for posting raw data to a solr port.
note: specifying the url/core/collection name is mandatory.
data can be read from files specified as commandline args,
urls specified as args, as raw commandline arg strings or via stdin.
examples:
java -dc=gettingstarted -jar post.jar *.xml
java -ddata=args -dc=gettingstarted -jar post.jar '<delete><id>42</id></delete>'
java -ddata=stdin -dc=gettingstarted -jar post.jar < hd.xml
java -ddata=web -dc=gettingstarted -jar post.jar http://example.com/
java -dtype=text/csv -dc=gettingstarted -jar post.jar *.csv
java -dtype=application/json -dc=gettingstarted -jar post.jar *.json
java -durl=http://localhost:8983/solr/techproducts/update/extract -dparams=literal.id=pdf1 -jar post.jar solr-word.pdf
java -dauto -dc=gettingstarted -jar post.jar *
java -dauto -dc=gettingstarted -drecursive -jar post.jar afolder
java -dauto -dc=gettingstarted -dfiletypes=ppt,html -jar post.jar afolder
the options controlled by system properties include the solr
url to post to, the content-type of the data, whether a commit
or optimize should be executed, and whether the response should
be written to stdout. if auto=yes the tool will try to set type
automatically from file name. when posting rich documents the
file name will be propagated as "resource.name" and also used
as "literal.id". you may override these or any other request parameter
through the -dparams property. to do a commit only, use "-" as argument.
the web mode is a simple crawler following links within domain, default delay=10s.
重點在這裡:

java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
要看懂這個post.jar使用指令規範,你首先需要知道,被中括号包住的參數表示可選參數即這個參數可有可有,| 表示或者,systemproperties表示系統屬性,什麼叫系統屬性呢?即你通過system.setproperty();設定的參數,比如:

system.setproperty(key,value);
supported system properties and their defaults:
這句下面列出了post.jar支援的幾個自定義系統屬性,下面我會對每個自定義系統屬性一一做個說明:

-dc=<core/collection>
-durl=<base solr update url> (overrides -dc option if specified)
-ddata=files|web|args|stdin (default=files)
-dtype=<content-type> (default=application/xml)
-dhost=<host> (default: localhost)
-dport=<port> (default: 8983)
-dauto=yes|no (default=no)
-drecursive=yes|no|<depth> (default=0)
-ddelay=<seconds> (default=0 for files, 10 for web)
-dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
-dparams="<key>=<value>[&<key>=<value>...]" (values must be url-encoded)
-dcommit=yes|no (default=yes)
-doptimize=yes|no (default=no)
-dout=yes|no (default=no)
-d是指令行下指定系統屬性的固定字首,
c表示core名稱,你需要對solr admin裡的哪個core進行索引資料添加/修改/删除
url表示solr admin背景索引更新的請求url,這個url是固定的,一般格式是http://host:port/solr/${corename}/update,這裡的${corename}和上面的c屬性值保持一緻
data表示你要送出資料的幾種模式,files模式表示你要送出的資料在檔案裡
web表示你要送出的資料在網際網路上的一個url表示的資源檔案裡
args表示你要送出的資料你會直接在post.jar指令後面直接輸入
stdin表示你要送出的資料需要在dos指令行下通過system.in輸入流臨時接收,跟args有點類似,
但不同的是,stdin模式下,post.jar後面不需要指定任何參數,直接回車即可,然後程式會等待使用者輸入,
使用者輸入完畢再回車,post.jar會接收到使用者輸入,post.jar重新被喚醒繼續執行。而args是直接在post.jar後面
輸入參數,沒有一個中斷過程,而stdin模式下如果使用者一直沒有輸入,那post.jar就會一直阻塞在那裡等待使用者輸入為止。
type表示你要送出資料的mime類型,預設是application/xml即預設會當作是xml來處理
host表示你要連結的solr admin部署伺服器的主機名或者ip位址,預設是localhost
port表示你要連結的solr admin部署的web容器監聽的端口号,預設post.jar裡設定為8983
port具體值取決于你實際部署環境而定
auto表示是否自動猜測檔案類型
recursive表示是否遞歸,這裡遞歸有兩種情況,比如你data=folder即表示是否遞歸查找檔案夾下的
所有檔案,如果你data=web即表示是否遞歸抓取url,設定為no即表示不遞歸操作,設定為一個數字,
即表示遞歸深度
delay:這裡的時間延遲也分兩種,如果你post的是file,那麼每個file的post間隔為0,即不做延遲處理,
而如果你是post的是網絡上的一個url資源,因為需要收到對方伺服器的通路限制,是以必須要做一個抓取
頻率限制即每抓一個睡眠一會兒,否則抓取太快太頻率容易被對方封ip。
filetypes表示post.jar支援送出哪些檔案類型,後面有列出預設支援的檔案類型,如果你想覆寫預設值,那麼
請指定此參數
params表示需要追加到solr admin的請求url後面的請求參數如id=1&name=yida之類的
commit表示是否送出到solr admin背景進行索引寫入,設定為false表示不送出至sor admin,但設定為true也不一定
就意味着就一定會把索引寫入磁盤,這取決于solrconfig中directory配置的實作是什麼,如果配置的是ramdirectory,就僅僅隻在記憶體中操作了。
optimize表示是否需要對索引進行優化操作,預設為no即表示不對索引進行優化
out即outputstream表示輸出流,這個參數作用就是,你請求solr admin添加索引資料,solr admin背景會傳回資料給你,solr admin背景傳回的資料你拿什麼輸出流來接收,預設是system.out即表示把背景傳回的資訊輸出列印到控制台
了解上面的相關說明,再來看看官方提供的幾個post.jar使用指令示例,是不是感覺so easy了?

ok,post.jar知道怎麼玩了,那是不是該來實踐一把?要想往solr admin背景添加索引資料,你首先需要添加一個core,添加一個core你可以通過solr admin的web ui來建立,如圖:
instancedir就是你的core根目錄,solr-hone就是你的solr_home,你可以在solr_home下建立多個core目錄,datadir表示你core的資料目錄,目前core的索引資料會存放在datadir下的data\index目錄下,上述所有檔案夾需要你手動建立(除了data\index這裡的index目錄,solr會自動建立),如圖:
solr_home目錄下需要一個solr.xml,這個配置檔案可以從solr的zip包裡擷取,如圖:
如圖找到solr.xml複制到你自己的solr-home根目錄下,然後你的core目錄下需要一個conf目錄,用來存放目前core的solr配置,這些配置檔案可以從solr的examples裡找到,如圖:
solrconfig.xml配置檔案是每個core必須的一個配置檔案,隻對目前core有效,sechma.xml配置檔案是用來定義索引的每個域的,比如域的名稱啊,域的類型,域是否索引,是否存儲,是否分詞,是否存儲項向量,使用什麼分詞器,指定同義詞字典檔案在哪兒,指定停用詞字典檔案在哪兒等等,這些資訊都是是sechma.xml中定義的,如果你有點lucene基礎,那編寫schema.xml就沒什麼壓力了,隻不過以前在lucene中是直接使用lucene api來定義域的這些資訊的,現在改用xml形式表達同樣的意思。注意裡面還有個protwords.txt字典檔案,這在lucene中還沒接觸過。下面是一段有關protwords.txt字典檔案的解釋說明:

protwords are the words which you do not want to be stemmed (in stemming
case manager/managing/managed/manageable all are indexed as ---> manag. same
thing goes in case of searching. in case you do not want a particular word
to be stemmed at index/search time just put it in protwords.txt of solr.
大意就是protwords表示那些你不想被還原的單詞,比如manager/managing/managed/manageable這些單詞,
在stemming模式下,他們全都被索引為manag,如果你不希望某個單詞被stemming(轉換成原型),那麼你就可以把他們放入protwords.txt字典檔案中,這樣他們就不會被還原成原型了。
prot即protected縮寫,即受保護的意思,隻有英文才存在單詞還原的情況。
這樣你的core目錄結構就建立好了,如果你不按這種規範去建立目錄結構,那麼你在建立core的時候會報錯,比如你可能會遇到這樣的異常:
core建立成功後,你會在solr admin 背景看到這樣的界面:
當然你也可以直接通過在浏覽器輸入url的方式來建立,
http://localhost:8080/solr/admin/cores?action=create&name=core2&instancedir=/opt/solr/core2&config=solrconfig.xml&schema=schema.xml&datadir=data
name:就是你的core名稱,
instancedir就是你的core根目錄,舉個例子,linux下可能是/opt/solr/core2,windows下可能是c:/solr/core2
config,schema即core的兩個重要的配置檔案的名稱,隻要你core目錄結構按規範建立好了,就會按照你指定的配置檔案名稱去conf目錄下去找,datadir表示你的core的資料目錄,該使用者主要用來存放你目前core的索引資料
core建立好了,那就可以在指令行下執行post.jar往solr admin中添加索引了,首先你需要在dos下切到post.jar所在目錄,如圖:
在運作post.jar指令之前,我們需要找一個測試用的xml檔案,這裡我以solr的examples目錄下提供的xml為例,如圖:
然後到solr admin web背景界面重新整理頁面,檢視core-c的索引數量是否有變化,如圖:
但是要注意,不是任何xml檔案都可以被索引的,送出的xml内容是有固定的編寫格式的,打開我們剛剛送出的xml檔案,如圖:
<add>表示添加索引,一對<doc></doc>表示lucene中的一個document,field表示域,name毫無疑問就是域名,field标簽之間的值就是域值,<add>标簽隻有有一個,<add>标簽下可以有多個<doc>标簽,多個<doc>即表示批量添加多個document.
<add>标簽還有2個可選屬性,
commitwithin:表示document必須在指定的毫秒數内送出成功,否則就放棄送出。
你還可以為某個document設定權重,比如:

<add>
<doc boost="2.5">
<field name="employeeid">05991</field>
<field name="office" boost="2.0">bridgewater</field>
</doc>
</add>
如何添加多值域?

<doc>
<field name="skills" update="set">python</field>
<field name="skills" update="set">java</field>
<field name="skills" update="set">jython</field>
如何将某個域的值設為null?

<field name="skills" update="set" null="true" />
你還可以在<add>标簽下添加

<commit/>
<optimize/>
類似于你在lucene裡顯式的調用writer.commit();writer.optimize();
如何根據id删除document?(注意這裡說的id指的是uniquekey指定的域,uniquekey是在schema.xml中定義的,不要與document的文檔id混為一談)

<delete><id>05991</id></delete>
如何根據一個query删除一個document呢?

<delete><query>office:bridgewater</query></delete>
office表示域名,bridgewater表示域值,預設建立的是termquery,域值可以有通配符,可以是正規表達式,可以使用queryparser表達式表示,你懂的。
上面說的都是在指令行下操作,如果你覺得在指令行下操作有點蛋疼,那我們也可以在eclipse中操作,通過反編譯post.jar我發現post.jar包裡面就是一個simpleposttool類,我花了點時間閱讀了simpleposttool類的源碼并對其關鍵位置加了一些注釋,源碼如下:

package com.yida.framework.solr5.test;
import java.io.bufferedreader;
import java.io.bytearrayinputstream;
import java.io.bytearrayoutputstream;
import java.io.file;
import java.io.filefilter;
import java.io.fileinputstream;
import java.io.ioexception;
import java.io.inputstream;
import java.io.inputstreamreader;
import java.io.outputstream;
import java.net.httpurlconnection;
import java.net.malformedurlexception;
import java.net.protocolexception;
import java.net.url;
import java.net.urlencoder;
import java.nio.bufferoverflowexception;
import java.nio.bytebuffer;
import java.nio.charset.charset;
import java.nio.charset.standardcharsets;
import java.text.simpledateformat;
import java.util.arraylist;
import java.util.date;
import java.util.hashmap;
import java.util.hashset;
import java.util.linkedhashset;
import java.util.list;
import java.util.locale;
import java.util.map;
import java.util.set;
import java.util.timezone;
import java.util.regex.pattern;
import java.util.regex.patternsyntaxexception;
import java.util.zip.gzipinputstream;
import java.util.zip.inflater;
import java.util.zip.inflaterinputstream;
import javax.xml.bind.datatypeconverter;
import javax.xml.parsers.documentbuilderfactory;
import javax.xml.parsers.parserconfigurationexception;
import javax.xml.xpath.xpath;
import javax.xml.xpath.xpathconstants;
import javax.xml.xpath.xpathexpression;
import javax.xml.xpath.xpathexpressionexception;
import javax.xml.xpath.xpathfactory;
import org.w3c.dom.document;
import org.w3c.dom.node;
import org.w3c.dom.nodelist;
import org.xml.sax.saxexception;
/**
* 往solr admin背景送出索引資料的一個小測試工具
* @author lanxiaowei
*
*/
@suppresswarnings("unused")
public class simpleposttool {
/**solr admin背景部署伺服器的主機名或ip位址,預設為localhost即本地*/
private static final string default_post_host = "localhost";
/**solr admin背景部署容器監聽的端口号,預設為8983*/
private static final string default_post_port = "8983";
/**目前工具的版本号*/
private static final string version_of_this_tool = "5.1.0";
/**是否送出索引*/
private static final string default_commit = "yes";
/**是否需要優化索引*/
private static final string default_optimize = "no";
/**是否将輸出流設定為system.out即控制台輸出流*/
private static final string default_out = "no";
/**是否自動猜測檔案mime類型,預設是按照檔案字尾名進行判定*/
private static final string default_auto = "no";
/**是否遞歸抓取,0表示不遞歸抓取,1表示遞歸抓取*/
private static final string default_recursive = "0";
/**抓取時間間隔即每抓取一個url後睡眠多少秒,機關:秒*/
private static final int default_web_delay = 10;
/**預設索引送出時間間隔即每送出一個睡眠多少毫秒,機關:毫秒*/
private static final int default_post_delay = 10;
/**對于url就是抓取深度,對于檔案夾就是目錄深度,目前深度為0*/
private static final int max_web_depth = 10;
/**預設檔案mime類型*/
private static final string default_content_type = "application/xml";
/**預設支援送出的檔案類型*/
private static final string default_file_types = "xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log";
/**檔案送出模式*/
static final string data_mode_files = "files";
/**url後面挂參數形式送出*/
static final string data_mode_args = "args";
/**标準輸入模式,選擇了這種模式的話,程式會中斷,等待使用者輸入作為送出資料*/
static final string data_mode_stdin = "stdin";
/**爬蟲抓取模式送出索引即需要使用者提供一個待抓的頁面連結,内部去抓取頁面内容然後送出*/
static final string data_mode_web = "web";
/**預設送出模式為files即送出檔案*/
static final string default_data_mode = "files";
boolean auto = false;
int recursive = 0;
int delay = 0;
string filetypes;
url solrurl;
outputstream out = null;
string type;
string mode;
boolean commit;
boolean optimize;
string[] args;
private int currentdepth;
static hashmap<string, string> mimemap;
globfilefilter globfilefilter;
//每個深度的url集合,這裡的list索引即抓取深度
list<linkedhashset<url>> backlog = new arraylist<linkedhashset<url>>();
//已抓取過的url集合
set<url> visited = new hashset<url>();
static final set<string> data_modes = new hashset<string>();
static final string usage_string_short = "usage: java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]";
static boolean mockmode = false;
static pagefetcher pagefetcher;
public static void main(string[] args) {
string corename = "core-test";
system.setproperty("c",corename);
info("simpleposttool version 5.1.0");
if ((0 < args.length)
&& (("-help".equals(args[0])) || ("--help".equals(args[0])) || ("-h"
.equals(args[0])))) {
//列印post.jar指令提示資訊
usage();
} else {
simpleposttool t = parseargsandinit(args);
t.execute();
}
}
public void execute() {
long starttime = system.currenttimemillis();
if (("files".equals(this.mode)) && (this.args.length > 0)) {
dofilesmode();
} else if (("args".equals(this.mode)) && (this.args.length > 0)) {
doargsmode();
} else if (("web".equals(this.mode)) && (this.args.length > 0)) {
dowebmode();
} else if ("stdin".equals(this.mode)) {
dostdinmode();
usageshort();
return;
if (this.commit)
commit();
if (this.optimize)
optimize();
long endtime = system.currenttimemillis();
displaytiming(endtime - starttime);
private void displaytiming(long millis) {
simpledateformat df = new simpledateformat("h:mm:ss.sss",
locale.getdefault());
df.settimezone(timezone.gettimezone("utc"));
system.out.println(new stringbuilder().append("time spent: ")
.append(df.format(new date(millis))).tostring());
protected static simpleposttool parseargsandinit(string[] args) {
string urlstr = null;
try {
string mode = system.getproperty("data", "files");
if (!data_modes.contains(mode)) {
fatal(new stringbuilder()
.append("system property 'data' is not valid for this tool: ")
.append(mode).tostring());
}
//需要追加到solr請求url後面的請求參數
string params = system.getproperty("params", "");
string host = system.getproperty("host", default_post_host);
string port = system.getproperty("port", default_post_port);
string core = system.getproperty("c");
urlstr = system.getproperty("url");
if ((urlstr == null) && (core == null)) {
fatal("specifying either url or core/collection is mandatory.\nusage: java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]");
//若沒有指定solr請求url,則生成預設的solr請求url
if (urlstr == null) {
urlstr = string.format(locale.root,
"http://%s:%s/solr/%s/update", new object[] { host,
port, core });
urlstr = appendparam(urlstr, params);
url url = new url(urlstr);
boolean auto = ison(system.getproperty("auto", default_auto));
string type = system.getproperty("type");
int recursive = 0;
string r = system.getproperty("recursive", default_recursive);
try {
recursive = integer.parseint(r);
} catch (exception e) {
if (ison(r)) {
recursive = "web".equals(mode) ? 1 : 999;
}
int delay = "web".equals(mode) ? default_web_delay : 0;
delay = integer.parseint(system
.getproperty("delay", delay + ""));
outputstream out = ison(system.getproperty("out", default_out)) ? system.out
: null;
string filetypes = system.getproperty("filetypes",default_file_types);
boolean commit = ison(system.getproperty("commit", default_commit));
boolean optimize = ison(system.getproperty("optimize", default_optimize));
return new simpleposttool(mode, url, auto, type, recursive, delay,
filetypes, out, commit, optimize, args);
} catch (malformedurlexception e) {
fatal(new stringbuilder()
.append("system property 'url' is not a valid url: ")
.append(urlstr).tostring());
return null;
public simpleposttool(string mode, url url, boolean auto, string type,
int recursive, int delay, string filetypes, outputstream out,
boolean commit, boolean optimize, string[] args) {
this.mode = mode;
this.solrurl = url;
this.auto = auto;
this.type = type;
this.recursive = recursive;
this.delay = delay;
this.filetypes = filetypes;
this.globfilefilter = getfilefilterfromfiletypes(filetypes);
this.out = out;
this.commit = commit;
this.optimize = optimize;
this.args = args;
pagefetcher = new pagefetcher();
public simpleposttool() {
/**
* 要送出的索引資料存在于檔案中,你可以通過args指定一個檔案目錄或者一個檔案路徑或者xxxx\*.xml這種通配符形式
*/
private void dofilesmode() {
this.currentdepth = 0;
if (!this.args[0].equals("-")) {
info(new stringbuilder()
.append("posting files to [base] url ")
.append(this.solrurl)
.append(!this.auto ? new stringbuilder()
.append(" using content-type ")
.append(this.type == null ? default_content_type
: this.type).tostring() : "").append("...")
.tostring());
if (this.auto)
info(new stringbuilder()
.append("entering auto mode. file endings considered are ")
.append(this.filetypes).tostring());
if (this.recursive > 0)
.append("entering recursive mode, max depth=")
.append(this.recursive).append(", delay=")
.append(this.delay).append("s").tostring());
int numfilesposted = postfiles(this.args, 0, this.out, this.type);
info(new stringbuilder().append(numfilesposted)
.append(" files indexed.").tostring());
* 要送出的索引資料直接通過args post方式送出到solr admin背景
private void doargsmode() {
info(new stringbuilder().append("posting args to ")
.append(this.solrurl).append("...").tostring());
for (string a : this.args) {
postdata(stringtostream(a), null, this.out, this.type, this.solrurl);
* 要送出的資料存在于網際網路,需要即時去抓取網頁内容,然後送出
* @return
private int dowebmode() {
reset();
int numpagesposted = 0;
if (this.type != null) {
fatal("specifying content-type with \"-ddata=web\" is not supported");
if (this.args[0].equals("-")) {
return 0;
this.solrurl = appendurlpath(this.solrurl, "/extract");
info(new stringbuilder().append("posting web pages to solr url ")
.append(this.solrurl).tostring());
this.auto = true;
.append("entering auto mode. indexing pages with content-types corresponding to file endings ")
.append(this.filetypes).tostring());
if (this.recursive > 0) {
if (this.recursive > max_web_depth) {
this.recursive = max_web_depth;
warn("too large recursion depth for web mode, limiting to 10...");
if (this.delay < default_web_delay)
warn("never crawl an external web site faster than every "+default_web_delay+" seconds, your ip will probably be blocked");
.append("entering recursive mode, depth=")
numpagesposted = postwebpages(this.args, 0, this.out);
info(new stringbuilder().append(numpagesposted)
.append(" web pages indexed.").tostring());
.append("wrong url trying to append /extract to ")
return numpagesposted;
private void dostdinmode() {
info(new stringbuilder().append("posting stdin to ")
postdata(system.in, null, this.out, this.type, this.solrurl);
private void reset() {
this.filetypes = "xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log";
this.globfilefilter = getfilefilterfromfiletypes(this.filetypes);
this.backlog = new arraylist<linkedhashset<url>>();
this.visited = new hashset<url>();
* 列印post.jar指令使用示例
private static void usageshort() {
system.out
.println("usage: java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]\n please invoke with -h option for extended usage help.");
* 列印post.jar指令提示資訊
private static void usage() {
.println("usage: java [systemproperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]\n\nsupported system properties and their defaults:\n -dc=<core/collection>\n -durl=<base solr update url> (overrides -dc option if specified)\n -ddata=files|web|args|stdin (default=files)\n -dtype=<content-type> (default=application/xml)\n -dhost=<host> (default: localhost)\n -dport=<port> (default: "+default_post_port+")\n -dauto=yes|no (default=no)\n -drecursive=yes|no|<depth> (default=0)\n -ddelay=<seconds> (default=0 for files, 10 for web)\n -dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)\n -dparams=\"<key>=<value>[&<key>=<value>...]\" (values must be url-encoded)\n -dcommit=yes|no (default=yes)\n -doptimize=yes|no (default=no)\n -dout=yes|no (default=no)\n\nthis is a simple command line tool for posting raw data to a solr port.\nnote: specifying the url/core/collection name is mandatory.\ndata can be read from files specified as commandline args,\nurls specified as args, as raw commandline arg strings or via stdin.\nexamples:\n java -dc=gettingstarted -jar post.jar *.xml\n java -ddata=args -dc=gettingstarted -jar post.jar '<delete><id>42</id></delete>'\n java -ddata=stdin -dc=gettingstarted -jar post.jar < hd.xml\n java -ddata=web -dc=gettingstarted -jar post.jar http://example.com/\n java -dtype=text/csv -dc=gettingstarted -jar post.jar *.csv\n java -dtype=application/json -dc=gettingstarted -jar post.jar *.json\n java -durl=http://localhost:8983/solr/techproducts/update/extract -dparams=literal.id=pdf1 -jar post.jar solr-word.pdf\n java -dauto -dc=gettingstarted -jar post.jar *\n java -dauto -dc=gettingstarted -drecursive -jar post.jar afolder\n java -dauto -dc=gettingstarted -dfiletypes=ppt,html -jar post.jar afolder\nthe options controlled by system properties include the solr\nurl to post to, the content-type of the data, whether a commit\nor optimize should be executed, and whether the response should\nbe written to stdout. if auto=yes the tool will try to set type\nautomatically from file name. when posting rich documents the\nfile name will be propagated as \"resource.name\" and also used\nas \"literal.id\". you may override these or any other request parameter\nthrough the -dparams property. to do a commit only, use \"-\" as argument.\nthe web mode is a simple crawler following links within domain, default delay="+default_web_delay+"s.");
* 送出檔案
* @param args
* @param startindexinargs
* @param out
* @param type
public int postfiles(string[] args, int startindexinargs, outputstream out,
string type) {
int filesposted = 0;
for (int j = startindexinargs; j < args.length; j++) {
file srcfile = new file(args[j]);
if ((srcfile.isdirectory()) && (srcfile.canread())) {
filesposted += postdirectory(srcfile, out, type);
} else if ((srcfile.isfile()) && (srcfile.canread())) {
filesposted += postfiles(new file[] { srcfile }, out, type);
} else {
file parent = srcfile.getparentfile();
if (parent == null)
parent = new file(".");
string fileglob = srcfile.getname();
globfilefilter ff = new globfilefilter(fileglob, false);
file[] files = parent.listfiles(ff);
if ((files == null) || (files.length == 0)) {
warn(new stringbuilder()
.append("no files or directories matching ")
.append(srcfile).tostring());
} else
filesposted += postfiles(parent.listfiles(ff), out, type);
return filesposted;
* @param files
public int postfiles(file[] files, int startindexinargs, outputstream out,
for (file srcfile : files) {
file[] filelist = parent.listfiles(ff);
if ((filelist == null) || (filelist.length == 0)) {
filesposted += postfiles(filelist, out, type);
* 送出目錄下所有檔案
* @param dir
* @return 傳回送出的檔案數量
private int postdirectory(file dir, outputstream out, string type) {
if ((dir.ishidden()) && (!dir.getname().equals(".")))
return 0;
info(new stringbuilder().append("indexing directory ")
.append(dir.getpath()).append(" (")
.append(dir.listfiles(this.globfilefilter).length)
.append(" files, depth=").append(this.currentdepth).append(")")
.tostring());
int posted = 0;
posted += postfiles(dir.listfiles(this.globfilefilter), out, type);
if (this.recursive > this.currentdepth) {
for (file d : dir.listfiles()) {
if (d.isdirectory()) {
this.currentdepth += 1;
posted += postdirectory(d, out, type);
this.currentdepth -= 1;
return posted;
public int postfiles(file[] files, outputstream out, string type) {
if ((!srcfile.isfile()) || (!srcfile.ishidden())) {
postfile(srcfile, out, type);
thread.sleep(default_post_delay);
filesposted++;
} catch (interruptedexception e) {
throw new runtimeexception();
* 根據使用者提供的url進行web模式送出索引資料
public int postwebpages(string[] args, int startindexinargs,
outputstream out) {
linkedhashset<url> s = new linkedhashset<url>();
url u = new url(normalizeurlending(args[j]));
s.add(u);
} catch (malformedurlexception e) {
warn(new stringbuilder()
.append("skipping malformed input url: ")
.append(args[j]).tostring());
//将url集合存入backlog
this.backlog.add(s);
//這裡0表示抓取深度,剛開始抓取深度為0
return webcrawl(0, out);
* 将不規範的url标準化
* @param link
protected static string normalizeurlending(string link) {
//如果url中包含#号,則直接截圖開頭至#号位置出,#後面部分丢棄
if (link.indexof("#") > -1) {
link = link.substring(0, link.indexof("#"));
//如果url以問号結尾,則删除結尾的問号
if (link.endswith("?")) {
link = link.substring(0, link.length() - 1);
//如果url以/結尾,則删除結尾的/
if (link.endswith("/")) {
return link;
* 頁面抓取
* @param level 目前抓取深度
protected int webcrawl(int level, outputstream out) {
int numpages = 0;
linkedhashset<url> stack = (linkedhashset<url>) this.backlog.get(level);
int rawstacksize = stack.size();
stack.removeall(this.visited);
int stacksize = stack.size();
linkedhashset<url> substack = new linkedhashset<url>();
info(new stringbuilder().append("entering crawl at level ")
.append(level).append(" (").append(rawstacksize)
.append(" links total, ").append(stacksize).append(" new)")
for (url u : stack) {
//目前url存入已通路清單,避免同一url重複抓取
this.visited.add(u);
//擷取到頁面内容pagefetcherresult
pagefetcherresult result = pagefetcher.readpagefromurl(u);
//狀态碼200表示頁面抓取成功
if (result.httpstatus == 200) {
u = result.redirecturl != null ? result.redirecturl : u;
//如果有頁面重定向,則抓取重定向後的頁面内容
url posturl = new url(appendparam(
this.solrurl.tostring(),
new stringbuilder()
.append("literal.id=")
.append(urlencoder.encode(u.tostring(),
"utf-8"))
.append("&literal.url=")
"utf-8")).tostring()));
boolean success = postdata(
new bytearrayinputstream(result.content.array(),
result.content.arrayoffset(),
result.content.limit()), null, out,
result.contenttype, posturl);
if (success) {
info(new stringbuilder().append("posted web resource ")
.append(u).append(" (depth: ").append(level)
.append(")").tostring());
thread.sleep(this.delay * 1000);
numpages++;
//如果抓取深度還沒超過限制
if ((this.recursive > level)
&& (result.contenttype.equals("text/html"))) {
//從抓取的頁面中提取出url
set<url> children = pagefetcher.getlinksfromwebpage(
u,
new bytearrayinputstream(result.content
.array(), result.content
.arrayoffset(), result.content
.limit()), result.contenttype,
posturl);
//把提取出來的url存入stack中
substack.addall(children);
}
} else {
warn(new stringbuilder()
.append("an error occurred while posting ")
.append(u).tostring());
}
} else {
warn(new stringbuilder().append("the url ").append(u)
.append(" returned a http result status of ")
.append(result.httpstatus).tostring());
} catch (ioexception e) {
.append("caught exception when trying to open connection to ")
.append(u).append(": ").append(e.getmessage())
.tostring());
if (!substack.isempty()) {
this.backlog.add(substack);
numpages += webcrawl(level + 1, out);
return numpages;
public static bytebuffer inputstreamtobytearray(baos bos,inputstream is)
throws ioexception {
return inputstreamtobytearray(bos,is, 2147483647l);
* 頁面輸入流轉換到輸出流,然後輸出流将接收到的位元組資料存入bytebuffer位元組緩沖區
* @param bos
* @param is
* @param maxsize
* @throws ioexception
public static bytebuffer inputstreamtobytearray(baos bos,inputstream is, long maxsize)
long sz = 0l;
int next = is.read();
while (next > -1) {
if (++sz > maxsize) {
throw new bufferoverflowexception();
bos.write(next);
next = is.read();
bos.flush();
is.close();
return bos.getbytebuffer();
* 計算完整的url,因為頁面上的a标簽的href屬性值可能是相對路徑,是以這裡需要拼接上baseurl,你懂的
* @param baseurl 網站根路徑
* @param link 從a标簽屬性值上提取出來的值
protected string computefullurl(url baseurl, string link) {
if ((link == null) || (link.length() == 0)) {
return null;
if (!link.startswith("http")) {
if (link.startswith("/")) {
link = new stringbuilder().append(baseurl.getprotocol())
.append("://").append(baseurl.getauthority())
.append(link).tostring();
if (link.contains(":")) {
return null;
string path = baseurl.getpath();
if (!path.endswith("/")) {
int sep = path.lastindexof("/");
string file = path.substring(sep + 1);
if ((file.contains(".")) || (file.contains("?")))
path = path.substring(0, sep);
.append(path).append("/").append(link).tostring();
link = normalizeurlending(link);
string l = link.tolowercase(locale.root);
//過濾調圖檔連結
if ((l.endswith(".jpg")) || (l.endswith(".jpeg"))
|| (l.endswith(".png")) || (l.endswith(".gif"))) {
* 判斷某個檔案類型是否在程式支援範圍内,支援範圍由mimemap變量定義
protected boolean typesupported(string type) {
for (string key : mimemap.keyset()) {
if ((((string) mimemap.get(key)).equals(type))
&& (this.filetypes.contains(key))) {
return true;
return false;
* 隻要輸入的是true,on,yes,1都傳回true
* @param property
protected static boolean ison(string property) {
return "true,on,yes,1".indexof(property) > -1;
* 列印警告資訊
* @param msg
static void warn(string msg) {
system.err.println(new stringbuilder()
.append("simpleposttool: warning: ").append(msg).tostring());
* 列印提示資訊
static void info(string msg) {
system.out.println(msg);
* 列印比較嚴重緻命性的資訊
static void fatal(string msg) {
.append("simpleposttool: fatal: ").append(msg).tostring());
system.exit(2);
* 送出索引資料至solr admin
public void commit() {
info(new stringbuilder().append("committing solr index changes to ")
doget(appendparam(this.solrurl.tostring(), "commit=true"));
* 發送索引優化請求至solr admin背景
public void optimize() {
info(new stringbuilder().append("performing an optimize to ")
doget(appendparam(this.solrurl.tostring(), "optimize=true"));
* 在url後面追加參數即id=1&mode=files格式
* @param url
* @param param
public static string appendparam(string url, string param) {
string[] pa = param.split("&");
for (string p : pa) {
if (p.trim().length() != 0) {
string[] kv = p.split("=");
if (kv.length == 2) {
url = new stringbuilder().append(url)
.append(url.indexof(63) > 0 ? "&" : "?")
.append(kv[0]).append("=").append(kv[1]).tostring();
warn(new stringbuilder().append("skipping param ")
.append(p)
.append(" which is not on form key=value")
.tostring());
return url;
public void postfile(file file, outputstream output, string type) {
inputstream is = null;
url url = this.solrurl;
string suffix = "";
if (this.auto) {
if (type == null) {
type = guesstype(file);
if (type != null) {
if ((!type.equals("application/xml"))
&& (!type.equals("text/csv"))
&& (!type.equals("application/json"))) {
suffix = "/extract";
string urlstr = appendurlpath(this.solrurl, suffix)
.tostring();
if (urlstr.indexof("resource.name") == -1) {
//往送出url後面追加resource.name參數即檔案的絕對路徑
urlstr = appendparam(
urlstr,
new stringbuilder()
.append("resource.name=")
.append(urlencoder.encode(
file.getabsolutepath(),
"utf-8")).tostring());
if (urlstr.indexof("literal.id") == -1) {
//往送出url後面追加literal.id參數即檔案的絕對路徑
.append("literal.id=")
url = new url(urlstr);
//未知的檔案類型則直接跳過,僅僅是列印下警告資訊
warn(new stringbuilder().append("skipping ")
.append(file.getname())
.append(". unsupported file type for auto mode.")
} else if (type == null) {
//如果自動猜測檔案類型關閉了,而檔案類型又為null,那隻好設定為預設值default_content_type
type = default_content_type;
.append("posting file ")
.append(file.getname())
.append(this.auto ? new stringbuilder().append(" (")
.append(type).append(")").tostring() : "")
.append(" to [base]").append(suffix).tostring());
is = new fileinputstream(file);
//開始送出檔案
postdata(is, integer.valueof((int) file.length()), output, type,
url);
} catch (ioexception e) {
e.printstacktrace();
warn(new stringbuilder().append("can't open/read file: ")
.append(file).tostring());
} finally {
if (is != null) {
is.close();
.append("ioexception while closing file: ").append(e)
* 往請求url追加内容,
* 如http://localhost:8080/solr/core1?param1=value1&param2=value2 追加一個/update後
* http://localhost:8080/solr/core1/update?param1=value1&param2=value2
* @param append
* @throws malformedurlexception
protected static url appendurlpath(url url, string append)
throws malformedurlexception {
return new url(new stringbuilder()
.append(url.getprotocol())
.append("://")
.append(url.getauthority())
.append(url.getpath())
.append(append)
.append(url.getquery() != null ? new stringbuilder()
.append("?").append(url.getquery()).tostring() : "")
* 根據檔案字尾名猜測檔案mime類型
* @param file
protected static string guesstype(file file) {
string name = file.getname();
string suffix = name.substring(name.lastindexof(".") + 1);
return (string) mimemap.get(suffix.tolowercase(locale.root));
* 發送get請求
public static void doget(string url) {
doget(new url(url));
warn(new stringbuilder().append("the specified url ").append(url)
.append(" is not a valid url. please check").tostring());
public static void doget(url url) {
if (mockmode) {
return;
httpurlconnection urlc = (httpurlconnection) url.openconnection();
if (url.getuserinfo() != null) {
string encoding = datatypeconverter.printbase64binary(url
.getuserinfo().getbytes(standardcharsets.us_ascii));
urlc.setrequestproperty("authorization", new stringbuilder()
.append("basic ").append(encoding).tostring());
//開始請求solr admin背景
urlc.connect();
//驗證是否請求成功
checkresponsecode(urlc);
warn(new stringbuilder()
.append("an error occurred posting data to ").append(url)
.append(". please check that solr is running.").tostring());
* post方式送出
* @param data
* @param length
* @param output
public boolean postdata(inputstream data, integer length,
outputstream output, string type, url url) {
if (mockmode) {
return true;
boolean success = true;
if (type == null)
type = default_content_type;
httpurlconnection urlc = null;
urlc = (httpurlconnection) url.openconnection();
try {
//設定http method為 post
urlc.setrequestmethod("post");
} catch (protocolexception e) {
//如果solr admin端服務不支援post請求,則列印異常資訊
fatal(new stringbuilder()
.append("shouldn't happen: httpurlconnection doesn't support post??")
.append(e).tostring());
urlc.setdooutput(true);
urlc.setdoinput(true);
urlc.setusecaches(false);
urlc.setallowuserinteraction(false);
urlc.setrequestproperty("content-type", type);
if (url.getuserinfo() != null) {
string encoding = datatypeconverter.printbase64binary(url
.getuserinfo().getbytes(standardcharsets.us_ascii));
urlc.setrequestproperty(
"authorization",
new stringbuilder().append("basic ")
.append(encoding).tostring());
if (null != length)
urlc.setfixedlengthstreamingmode(length.intvalue());
urlc.connect();
.append("connection error (is solr running at ")
.append(this.solrurl).append(" ?): ").append(e)
success = false;
throwable localthrowable3;
outputstream out = urlc.getoutputstream();
localthrowable3 = null;
pipe(data, out);
} catch (throwable localthrowable1) {
localthrowable3 = localthrowable1;
throw localthrowable1;
} finally {
if (out != null)
if (localthrowable3 != null) {
try {
out.close();
} catch (throwable x2) {
localthrowable3.addsuppressed(x2);
}
} else {
out.close();
.append("ioexception while posting data: ").append(e)
success &= checkresponsecode(urlc);
inputstream in = urlc.getinputstream();
pipe(in, output);
} catch (throwable localthrowable2) {
localthrowable3 = localthrowable2;
throw localthrowable2;
if (in != null)
if (localthrowable3 != null)
in.close();
else
in.close();
.append("ioexception while reading response: ")
.append(e).tostring());
if (urlc != null) {
urlc.disconnect();
return success;
* 根據響應狀态碼判斷是否送出成功了
* @param urlc
private static boolean checkresponsecode(httpurlconnection urlc)
//響應狀态碼如果大于等于400,表示請求失敗了
if (urlc.getresponsecode() >= 400) {
warn(new stringbuilder().append("solr returned an error #")
.append(urlc.getresponsecode()).append(" (")
.append(urlc.getresponsemessage()).append(") for url: ")
.append(urlc.geturl()).tostring());
charset charset = standardcharsets.iso_8859_1;
string contenttype = urlc.getcontenttype();
if (contenttype != null) {
int idx = contenttype.tolowercase(locale.root).indexof(
"charset=");
if (idx > 0) {
charset = charset.forname(contenttype.substring(
idx + "charset=".length()).trim());
inputstream errstream = urlc.geterrorstream();
throwable localthrowable2 = null;
if (errstream != null) {
bufferedreader br = new bufferedreader(
new inputstreamreader(errstream, charset));
stringbuilder response = new stringbuilder("response: ");
int ch;
while ((ch = br.read()) != -1) {
response.append((char) ch);
warn(response.tostring().trim());
} catch (throwable localthrowable1) {
localthrowable2 = localthrowable1;
throw localthrowable1;
} finally {
if (errstream != null)
if (localthrowable2 != null)
try {
errstream.close();
} catch (throwable x2) {
localthrowable2.addsuppressed(x2);
else
errstream.close();
return false;
return true;
* 字元串轉換成位元組輸入流
* @param s
public static inputstream stringtostream(string s) {
return new bytearrayinputstream(s.getbytes(standardcharsets.utf_8));
* 把輸入流傳輸到輸出流上
* @param source
* @param dest
private static void pipe(inputstream source, outputstream dest)
byte[] buf = new byte[1024];
int read = 0;
while ((read = source.read(buf)) >= 0) {
if (null != dest) {
dest.write(buf, 0, read);
if (null != dest) {
dest.flush();
* 根據傳入的filetype建構檔案過濾器
* @param filetypes
public globfilefilter getfilefilterfromfiletypes(string filetypes) {
string glob;
if (filetypes.equals("*")) {
glob = ".*";
glob = new stringbuilder().append("^.*\\.(")
.append(filetypes.replace(",", "|")).append(")$")
.tostring();
return new globfilefilter(glob, true);
* 根據xpath表達式擷取xml節點
* @param n
* @param xpath
* @throws xpathexpressionexception
public static nodelist getnodesfromxp(node n, string xpath)
throws xpathexpressionexception {
xpathfactory factory = xpathfactory.newinstance();
xpath xp = factory.newxpath();
xpathexpression expr = xp.compile(xpath);
return (nodelist) expr.evaluate(n, xpathconstants.nodeset);
* @param concatall 是否包含所有子節點,否則隻取第一個
public static string getxp(node n, string xpath, boolean concatall)
nodelist nodes = getnodesfromxp(n, xpath);
stringbuilder sb = new stringbuilder();
if (nodes.getlength() > 0) {
for (int i = 0; i < nodes.getlength(); i++) {
sb.append(new stringbuilder()
.append(nodes.item(i).getnodevalue()).append(" ")
if (!concatall) {
break;
return sb.tostring().trim();
return "";
* 把位元組資料轉換為document對象,為xml解析做準備
* @param in
* @throws saxexception
* @throws parserconfigurationexception
public static document makedom(byte[] in) throws saxexception, ioexception,
parserconfigurationexception {
inputstream is = new bytearrayinputstream(in);
document dom = documentbuilderfactory.newinstance()
.newdocumentbuilder().parse(is);
return dom;
static {
data_modes.add("files");
data_modes.add("args");
data_modes.add("stdin");
data_modes.add("web");
mimemap = new hashmap<string, string>();
mimemap.put("xml", "application/xml");
mimemap.put("csv", "text/csv");
mimemap.put("json", "application/json");
mimemap.put("pdf", "application/pdf");
mimemap.put("rtf", "text/rtf");
mimemap.put("html", "text/html");
mimemap.put("htm", "text/html");
mimemap.put("doc", "application/msword");
mimemap.put("docx",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document");
mimemap.put("ppt", "application/vnd.ms-powerpoint");
mimemap.put("pptx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation");
mimemap.put("xls", "application/vnd.ms-excel");
mimemap.put("xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
mimemap.put("odt", "application/vnd.oasis.opendocument.text");
mimemap.put("ott", "application/vnd.oasis.opendocument.text");
mimemap.put("odp", "application/vnd.oasis.opendocument.presentation");
mimemap.put("otp", "application/vnd.oasis.opendocument.presentation");
mimemap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
mimemap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
mimemap.put("txt", "text/plain");
mimemap.put("log", "text/plain");
public class pagefetcherresult {
int httpstatus = 200;
string contenttype = "text/html";
url redirecturl = null;
bytebuffer content;
public pagefetcherresult() {
* 頁面抓取類
* @author lanxiaowei
*
class pagefetcher {
map<string, list<string>> robotscache;
final string disallow = "disallow:";
public pagefetcher() {
this.robotscache = new hashmap<string, list<string>>();
/**
* 根據指定的url去抓取頁面,頁面内容包裝在pagefetcherresult對象中
* @param u
* @return
*/
public pagefetcherresult readpagefromurl(url u) {
pagefetcherresult res = new pagefetcherresult();
/**
* 如果目前url在roots.txt的禁止爬取清單中,則直接跳過
*/
if (isdisallowedbyrobots(u)) {
simpleposttool
.warn("the url "
+ u
+ " is disallowed by robots.txt and will not be crawled.");
res.httpstatus = 403;
simpleposttool.this.visited.add(u);
return res;
res.httpstatus = 404;
httpurlconnection conn = (httpurlconnection) u.openconnection();
conn.setrequestproperty("user-agent",
"simpleposttool-crawler/5.1.0 (http://lucene.apache.org/solr/)");
conn.setrequestproperty("accept-encoding", "gzip, deflate");
conn.connect();
res.httpstatus = conn.getresponsecode();
if (!simpleposttool
.normalizeurlending(conn.geturl().tostring())
.equals(simpleposttool.normalizeurlending(u.tostring()))) {
simpleposttool.info("the url " + u
+ " caused a redirect to " + conn.geturl());
u = conn.geturl();
res.redirecturl = u;
if (res.httpstatus == 200) {
string rawcontenttype = conn.getcontenttype();
string type = rawcontenttype.split(";")[0];
if (simpleposttool.this.typesupported(type)) {
string encoding = conn.getcontentencoding();
inputstream is = null;
if ((encoding != null)
&& (encoding.equalsignorecase("gzip"))) {
is = new gzipinputstream(conn.getinputstream());
if ((encoding != null)
&& (encoding.equalsignorecase("deflate")))
is = new inflaterinputstream(
conn.getinputstream(), new inflater(
true));
else {
is = conn.getinputstream();
baos bos = new baos();
res.content = simpleposttool.inputstreamtobytearray(bos,is);
is.close();
bos.close();
simpleposttool
.warn("skipping url with unsupported type "
+ type);
res.httpstatus = 415;
simpleposttool.warn("ioexception when reading page from url "
+ u + ": " + e.getmessage());
return res;
* 根據roots.txt資訊判斷指定url是否可以抓取
* @param url
public boolean isdisallowedbyrobots(url url) {
string host = url.gethost();
//拼接網站的roots.txt通路位址
string strrobot = url.getprotocol() + "://" + host + "/robots.txt";
//先從緩存中擷取目前網站的roots資訊
list<string> disallows = (list<string>) this.robotscache.get(host);
//若緩存中沒有
if (disallows == null) {
disallows = new arraylist<string>();
//則根據拼接的roots.txt通路位址去解析擷取
url urlrobot = new url(strrobot);
//解析roots資訊
disallows = parserobotstxt(urlrobot.openstream());
} catch (malformedurlexception e) {
return true;
} catch (ioexception e) {
//緩存到 map中
this.robotscache.put(host, disallows);
//判斷是否存在于roots的禁爬清單中
string strurl = url.getfile();
for (string path : disallows) {
if ((path.equals("/")) || (strurl.indexof(path) == 0)) {
//return false即表示不是禁爬url
* 根據roots.txt輸入流解析roots資訊,存入list中,一般是一行一條url
* @param is
* @throws ioexception
protected list<string> parserobotstxt(inputstream is)
throws ioexception {
list<string> disallows = new arraylist<string>();
bufferedreader r = new bufferedreader(new inputstreamreader(is,
standardcharsets.utf_8));
string l;
while ((l = r.readline()) != null) {
string[] arr = l.split("#");
if (arr.length != 0) {
l = arr[0].trim();
//我們隻關心禁爬url資訊,disallow不允許的意思即禁爬
if (l.startswith("disallow:")) {
l = l.substring("disallow:".length()).trim();
if (l.length() != 0) {
disallows.add(l);
is.close();
return disallows;
* 從抓取到的頁面内容中提取出url
* @param type
* @param posturl
protected set<url> getlinksfromwebpage(url u, inputstream is,
string type, url posturl) {
set<url> l = new hashset<url>();
url url = null;
bytearrayoutputstream os = new bytearrayoutputstream();
url extracturl = new url(simpleposttool.appendparam(
posturl.tostring(), "extractonly=true"));
boolean success = simpleposttool.this.postdata(is, null, os,
type, extracturl);
if (success) {
document d = simpleposttool.makedom(os.tobytearray());
string innerxml = simpleposttool.getxp(d,
"/response/str/text()[1]", false);
d = simpleposttool.makedom(innerxml
.getbytes(standardcharsets.utf_8));
//這個xpath表達式表示:擷取html标簽下的body标簽下的所有a标簽的href屬性值
nodelist links = simpleposttool.getnodesfromxp(d,
"/html/body//a/@href");
for (int i = 0; i < links.getlength(); i++) {
string link = links.item(i).gettextcontent();
link = simpleposttool.this.computefullurl(u, link);
if (link != null) {
url = new url(link);
if ((url.getauthority() != null)
&& (url.getauthority().equals(u
.getauthority()))) {
l.add(url);
simpleposttool.warn("malformed url " + url);
simpleposttool.warn("ioexception opening url " + url + ": "
+ e.getmessage());
return l;
* 自定義檔案過濾器
class globfilefilter implements filefilter {
private string _pattern;
private pattern p;
* isregex用來表示第一個參數pattern是否為一個正規表達式
* @param pattern
* @param isregex
public globfilefilter(string pattern, boolean isregex) {
this._pattern = pattern;
//如果pattern參數不是一個正規表達式
if (!isregex) {
//不是正規表達式的話,則需要對正規表達式裡的特殊字元進行轉義,是以這裡的處理就不言自明了
this._pattern = this._pattern.replace("^", "\\^")
.replace("$", "\\$").replace(".", "\\.")
.replace("(", "\\(").replace(")", "\\)")
.replace("+", "\\+").replace("*", ".*")
.replace("?", ".");
//經過上一步處理後this._pattern參數就被當作一個普通的檔案名了,
//再在開頭加^結尾加$轉換成正規表達式
this._pattern = ("^" + this._pattern + "$");
//這裡的2即pattern.case_insensitive即忽略大小寫的意思
this.p = pattern.compile(this._pattern, 2);
} catch (patternsyntaxexception e) {
simpleposttool.fatal("invalid type list " + pattern + ". "
+ e.getdescription());
/**根據正規表達式比對結果判斷是否傳回這個檔案*/
public boolean accept(file file) {
return this.p.matcher(file.getname()).find();
* 自定義位元組輸出流
* @author administrator
public static class baos extends bytearrayoutputstream {
//把輸出流存入bytebuffer位元組緩沖區中,因為bytebuffer比byte[]讀寫效率要高
public bytebuffer getbytebuffer() {
return bytebuffer.wrap(this.buf, 0, this.count);
}
看懂了post.jar的源碼,有助于你更熟練使用post.jar來進行索引的添加删除等操作,下面截圖示範如何在eclipse下運作simpleposttool類進行索引測試操作,如圖:
如果你還有什麼問題請加我Q-q:7-3-6-0-3-1-3-0-5,
或者加裙
一起交流學習!
轉載:http://iamyida.iteye.com/blog/2207920