nutch1.2 index 详解

首先如果存在crawl/index ,crawl/indexes目录则删除

[img]http://dl.iteye.com/upload/attachment/0070/9519/a430b9dc-5f53-30cf-8a29-9fdcfd640db8.jpg[/img]

map：IndexerMapReduce

map输入目录为所有的segment的crawl_fetch crawl_parse parse_data parse_text , crawl/crawldb/current, crawl/linkdb/current

1 map的任务就是为了合并目录代码如下

output.collect(key, new NutchWritable(value));

reduce： IndexerMapReduce

1 循环解析出路四个对象就是抓取和解析成功

if (fetchDatum == null || dbDatum == null

|| parseText == null || parseData == null) {

return; // only have inlinks

}

2 如果抓取成功和解析成功往下执行

if (!parseData.getStatus().isSuccess() ||

fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {

return;

}

3 创建NutchDocument 加入segment ，签名，field

4 通过IndexingFilters，这个filters,会调用配置的BasicIndexingFilter和AnchorIndexingFilter，filter方法，

5 BasicIndexingFilter设置host ,site ,url ,content,title 长度超过indexer.max.title.length会最title做截取,设置tstamp，

6 AnchorIndexingFilter设置anchor

7 如果doc不为空掉用ScoringFilters 设置boost，weight

8 写入，这里的 job.setOutputFormat(IndexerOutputFormat.class);

IndexerOutputFormat 的方法如下

@Override

public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,

JobConf job, String name, Progressable progress) throws IOException {

// populate JobConf with field indexing options

IndexingFilters filters = new IndexingFilters(job);

[b]final NutchIndexWriter[] writers =

NutchIndexWriterFactory.getNutchIndexWriters(job);[/b] for (final NutchIndexWriter writer : writers) {

writer.open(job, name);

}

return new RecordWriter<Text, NutchDocument>() {

public void close(Reporter reporter) throws IOException {

for (final NutchIndexWriter writer : writers) {

writer.close();

}

public void write(Text key, NutchDocument doc) throws IOException {

for (final NutchIndexWriter writer : writers) {

writer.write(doc);

}

};

}

如果粗体所示他会使用 LuceneWriter 如下代码加入到

@SuppressWarnings("unchecked")

public static NutchIndexWriter[] getNutchIndexWriters(Configuration conf) {

final String[] classes = conf.getStrings("indexer.writer.classes");

final NutchIndexWriter[] writers = new NutchIndexWriter[classes.length];

for (int i = 0; i < classes.length; i++) {

final String clazz = classes[i];

try {

final Class<NutchIndexWriter> implClass =

(Class<NutchIndexWriter>) Class.forName(clazz);

writers[i] = implClass.newInstance();

} catch (final Exception e) {

throw new RuntimeException("Couldn't create " + clazz, e);

}

return writers;

}

public static void addClassToConf(Configuration conf,

Class<? extends NutchIndexWriter> clazz) {

final String classes = conf.get("indexer.writer.classes");

final String newClass = clazz.getName();

if (classes == null) {

conf.set("indexer.writer.classes", newClass);

} else {

conf.set("indexer.writer.classes", classes + "," + newClass);

}

NutchIndexWriterFactory.addClassToConf(job, LuceneWriter.class);

打开indexwriter的方法

[b] for (final NutchIndexWriter writer : writers) {

writer.open(job, name);

}[/b]

代码如下

public void open(JobConf job, String name)

throws IOException {

this.fs = FileSystem.get(job);

perm = new Path(FileOutputFormat.getOutputPath(job), name);

temp = job.getLocalPath("index/_" +

Integer.toString(new Random().nextInt()));

fs.delete(perm, true); // delete old, if any

analyzerFactory = new AnalyzerFactory(job);

writer = new IndexWriter(

FSDirectory.open(new File(fs.startLocalOutput(perm, temp).toString())),

new NutchDocumentAnalyzer(job), true, MaxFieldLength.UNLIMITED);

writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));

writer.setMaxBufferedDocs(job.getInt("indexer.minMergeDocs", 100));

writer.setMaxMergeDocs(job

.getInt("indexer.maxMergeDocs", Integer.MAX_VALUE));

writer.setTermIndexInterval(job.getInt("indexer.termIndexInterval", 128));

writer.setMaxFieldLength(job.getInt("indexer.max.tokens", 10000));

writer.setInfoStream(LogUtil.getDebugStream(Indexer.LOG));

writer.setUseCompoundFile(false);

writer.setSimilarity(new NutchSimilarity());

processOptions(job);

}

写入代码如下

public void write(NutchDocument doc) throws IOException {

final Document luceneDoc = createLuceneDoc(doc);

final NutchAnalyzer analyzer = analyzerFactory.get(luceneDoc.get("lang"));

if (Indexer.LOG.isDebugEnabled()) {

Indexer.LOG.debug("Indexing [" + luceneDoc.get("url")

+ "] with analyzer " + analyzer + " (" + luceneDoc.get("lang")

+ ")");

}

writer.addDocument(luceneDoc, analyzer);

}

通过上面的流程就把索引写好了

nutch1.2 index 详解

继续阅读

nutch solr系列之（二）nutch命令分析

Nutch2 之 GeneratorJob

Nutch+Solr学习笔记环境搭建深入Solr实战 NutchTutorial Hadoop Shell命令 hadoop nutch solr 环境搭建手册 Solr调研总结 Nutch 插件系统浅析

nutch index路径找不到问题记录

Nutch 2.0 集群配置

Nutch总结

Nutch1.7学习笔记2：基本工作流程分析

Nutch-2.2.1学习之一Nutch简介

Nutch 1.3 学习笔记3 - Inject

nutch源码阅读(2)-Injector的初始化

nutch1.4 Injector 详解

nutch v1.9源码分析(3)——nutch基本爬取流程1 nutch基本爬取流程

sphinx配置文件

Nutch-2.2.1学习之七Nutch与Solr的集成

How to make nutch run in eclipse ?

编译安装nutch2.3和hbase0.98.8集成