天天看點

nutch2.1源碼分析

                                                        nutch2.1分表采集開發

1.nutch源碼擷取配置檔案nutch-default.xml的storage.schema.webpage屬性值:webpage:

類:org.apache.nutch.storage.StorageUtils.java

@SuppressWarnings("unchecked")
  public static <K, V extends Persistent> DataStore<K, V> createWebStore(Configuration conf,
      Class<K> keyClass, Class<V> persistentClass) throws ClassNotFoundException, GoraException {
    
    String schema = null;
    if (WebPage.class.equals(persistentClass)) {
      schema = conf.get("storage.schema.webpage", "webpage");
    } else if (Host.class.equals(persistentClass)) {
      schema = conf.get("storage.schema.host", "host");
    } else {
      throw new UnsupportedOperationException("Unable to create store for class " + persistentClass);
    }
    
    String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");
    
    if (!crawlId.isEmpty()) {
      conf.set("schema.prefix", crawlId + "_");
    } else {
      conf.set("schema.prefix", "");
    }

    Class<? extends DataStore<K, V>> dataStoreClass =
      (Class<? extends DataStore<K, V>>) getDataStoreClass(conf);
    return DataStoreFactory.createDataStore(dataStoreClass,
            keyClass, persistentClass, conf, schema);
  }
           

 2、過濾隻采集的位址:

regex-urlfilter.txt末尾增加: +^http://([a-z0-9]*\.)zhangzhou.gov.cn/
automaton-urlfilter.txt末尾增加:+^http://([a-z0-9]*\.)zhangzhou.gov.cn/
           

 3、nutch分表需要修改的配置檔案:

gora-sql-mapping.xml
nutch-default.xml
           

 4、nutch分表規則

采集存儲的資料表名+"_"采集位址id
           

 5、采集運作可能出現挂掉,異常資訊是:Output path is null in cleanup

解決方法:把類org.apache.nutch.indexer.elastic.ElasticWriter.java的異常位置try掉,如下:
  private void processExecute(boolean createNewBulk) {
   try {
	   if (execute != null) {
		      // wait for previous to finish
		      long beforeWait = System.currentTimeMillis();
		      BulkResponse actionGet = execute.actionGet();
		      if (actionGet.hasFailures()) {
		        for (BulkItemResponse item : actionGet) {
		          if (item.failed()) {
		            throw new RuntimeException("First failure in bulk: "
		                + item.getFailureMessage());
		          }
		        }
		      }
		      long msWaited = System.currentTimeMillis() - beforeWait;
		      LOG.info("Previous took in ms " + actionGet.getTookInMillis()
		          + ", including wait " + msWaited);
		      execute = null;
		    }
		    if (bulk != null) {
		      if (bulkDocs > 0) {
		        // start a flush, note that this is an asynchronous call
		        execute = bulk.execute();
		      }
		      bulk = null;
		    }
		    if (createNewBulk) {
		      // Prepare a new bulk request
		      bulk = client.prepareBulk();
		      bulkDocs = 0;
		      bulkLength = 0;
		    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}	  
   
  }
           

 6、儲存中文個别亂碼問題解決方案:try catch掉

類名:com.suncco.leadsite.utils.NutchJob.java
      @Override
  public boolean waitForCompletion(boolean verbose){
     boolean succeeded = true;
      try {
    	    succeeded = super.waitForCompletion(verbose);
            if (!succeeded) {
    	      // check if we want to fail whenever a job fails. (expert setting)
    	      if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
    	    	  Log.warn("job failed: " + "name=" + getJobName()
    	            + ", jobid=" + getJobID());
    	      }
    	    }
	} catch (Exception e) {
		// TODO: handle exception
		e.printStackTrace();
	}
    return succeeded;
  }
           

 觸發器編寫:

 觸發器主表不能删除主表本身操作否則會報錯如:

觸發器代碼:BEGIN  
         SET @isLocalCount =(select count(*) from webpage_39  where id = NEW.id); 
               IF (@isLocalCount >0 )  
                  THEN   
                      delete from webpage_39  where  id = NEW.id; 
               END IF; 
        SET @count =(select count(*) from suncco_spider.webpage  where id = NEW.id);   
        IF (@count =0  && NEW.id like '%fj.fj%')   
              THEN 
                  insert into suncco_spider.webpage (id, baseUrl, status, prevFetchTime, fetchTime,   fetchInterval, retriesSinceFetch, reprUrl,content,typ,protocolStatus,   modifiedTime,title,text,parseStatus, signature,prevSignature,score,   headers,inlinks,outlinks,metadata,markers,isDelete) values (NEW.id, NEW.baseUrl,  NEW.status,NEW.prevFetchTime,NEW.fetchTime, NEW.fetchInterval, NEW.retriesSinceFetch,  NEW.reprUrl,NEW.content,NEW.typ,NEW.protocolStatus,NEW.modifiedTime,NEW.title,NEW.text,NEW.parseStatus,   NEW.signature,NEW.prevSignature,NEW.score,NEW.headers,NEW.inlinks,NEW.outlinks,NEW.metadata,NEW.markers,NEW.isDelete);  
              END IF;   
        END

java.io.IOException: java.sql.BatchUpdateException: Can't update table 'webpage_39' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.


解決方法是把delete from webpage_39  where  id = NEW.id;删掉