nutch2.1分表采集開發
1.nutch源碼擷取配置檔案nutch-default.xml的storage.schema.webpage屬性值:webpage:
類:org.apache.nutch.storage.StorageUtils.java
@SuppressWarnings("unchecked")
public static <K, V extends Persistent> DataStore<K, V> createWebStore(Configuration conf,
Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException {
String schema = null;
if (WebPage.class.equals(persistentClass)) {
schema = conf.get("storage.schema.webpage", "webpage");
} else if (Host.class.equals(persistentClass)) {
schema = conf.get("storage.schema.host", "host");
} else {
throw new UnsupportedOperationException("Unable to create store for class " + persistentClass);
}
String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
if (!crawlId.isEmpty()) {
conf.set("schema.prefix", crawlId + "_");
} else {
conf.set("schema.prefix", "");
}
Class<? extends DataStore<K, V>> dataStoreClass =
(Class<? extends DataStore<K, V>>) getDataStoreClass(conf);
return DataStoreFactory.createDataStore(dataStoreClass,
keyClass, persistentClass, conf, schema);
}
2、過濾隻采集的位址:
regex-urlfilter.txt末尾增加: +^http://([a-z0-9]*\.)zhangzhou.gov.cn/
automaton-urlfilter.txt末尾增加:+^http://([a-z0-9]*\.)zhangzhou.gov.cn/
3、nutch分表需要修改的配置檔案:
gora-sql-mapping.xml
nutch-default.xml
4、nutch分表規則
采集存儲的資料表名+"_"采集位址id
5、采集運作可能出現挂掉,異常資訊是:Output path is null in cleanup
解決方法:把類org.apache.nutch.indexer.elastic.ElasticWriter.java的異常位置try掉,如下:
private void processExecute(boolean createNewBulk) {
try {
if (execute != null) {
// wait for previous to finish
long beforeWait = System.currentTimeMillis();
BulkResponse actionGet = execute.actionGet();
if (actionGet.hasFailures()) {
for (BulkItemResponse item : actionGet) {
if (item.failed()) {
throw new RuntimeException("First failure in bulk: "
+ item.getFailureMessage());
}
}
}
long msWaited = System.currentTimeMillis() - beforeWait;
LOG.info("Previous took in ms " + actionGet.getTookInMillis()
+ ", including wait " + msWaited);
execute = null;
}
if (bulk != null) {
if (bulkDocs > 0) {
// start a flush, note that this is an asynchronous call
execute = bulk.execute();
}
bulk = null;
}
if (createNewBulk) {
// Prepare a new bulk request
bulk = client.prepareBulk();
bulkDocs = 0;
bulkLength = 0;
}
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
6、儲存中文個别亂碼問題解決方案:try catch掉
類名:com.suncco.leadsite.utils.NutchJob.java
@Override
public boolean waitForCompletion(boolean verbose){
boolean succeeded = true;
try {
succeeded = super.waitForCompletion(verbose);
if (!succeeded) {
// check if we want to fail whenever a job fails. (expert setting)
if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
Log.warn("job failed: " + "name=" + getJobName()
+ ", jobid=" + getJobID());
}
}
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
return succeeded;
}
觸發器編寫:
觸發器主表不能删除主表本身操作否則會報錯如:
觸發器代碼:BEGIN
SET @isLocalCount =(select count(*) from webpage_39 where id = NEW.id);
IF (@isLocalCount >0 )
THEN
delete from webpage_39 where id = NEW.id;
END IF;
SET @count =(select count(*) from suncco_spider.webpage where id = NEW.id);
IF (@count =0 && NEW.id like '%fj.fj%')
THEN
insert into suncco_spider.webpage (id, baseUrl, status, prevFetchTime, fetchTime, fetchInterval, retriesSinceFetch, reprUrl,content,typ,protocolStatus, modifiedTime,title,text,parseStatus, signature,prevSignature,score, headers,inlinks,outlinks,metadata,markers,isDelete) values (NEW.id, NEW.baseUrl, NEW.status,NEW.prevFetchTime,NEW.fetchTime, NEW.fetchInterval, NEW.retriesSinceFetch, NEW.reprUrl,NEW.content,NEW.typ,NEW.protocolStatus,NEW.modifiedTime,NEW.title,NEW.text,NEW.parseStatus, NEW.signature,NEW.prevSignature,NEW.score,NEW.headers,NEW.inlinks,NEW.outlinks,NEW.metadata,NEW.markers,NEW.isDelete);
END IF;
END
java.io.IOException: java.sql.BatchUpdateException: Can't update table 'webpage_39' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
解決方法是把delete from webpage_39 where id = NEW.id;删掉