Nutch-1.x学习笔记

2023-08-02 21:53:43

Nutch单步操作

1、<创建种子url>

mkdir -p urls

cd urls

touch seed.txt

echo "http://www.qq.com/">>urls/seed.txt #每行一个种子url

2、<inject>

bin/nutch inject crawl/crawldb urls

3、<generate>

bin/nutch generate crawl/crawldb crawl/segments

4、<fetch>

s1=`ls -d crawl/segments/2* | tail -1`

echo $s1

bin/nutch fetch $s1

5、<parse>

bin/nutch parse $s1

6、<updatedb>

bin/nutch updatedb crawl/crawldb $s1

7、多次操作3-6步

8、<invertlinks>

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

9、<Indexing into Apache Solr>

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize

10、<Deleting Duplicates>

/bin/nutch solrdedup http://localhost:8983/solr

11、<Cleaning Solr>

/bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr

Nutch脚本操作

Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>

-i|--index Indexes crawl results into a configured indexer

-D A Java property to pass to Nutch calls

Seed Dir Directory in which to look for a seeds file

Crawl Dir Directory where the crawl/link/segments dirs are saved

Num Rounds The number of rounds to run this crawl for

Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2

refer from : http://wiki.apache.org/nutch/NutchTutorial

Nutch-1.x学习笔记

继续阅读

Thesurprisinglinkbetweenstressandmemory

Remembermebeforethememoryoflovedisappears.在爱的记忆消失以前，请记住我。

memory…#一起看海

lastmemoryofmy胖❤️。

记一次Native memory leak排查过程

iam＞iwas#生活碎片#memories#社恐

somememories🎞️

React.memo 和 useCallBack

React 中性能优化、 memo、 PureComponent、shouldComponentUpdate 的使用React 中性能优化、 memo、 PureComponent、shouldComponentUpdate 的使用

Debug and Performence Debug your programs like they're closed source!

Nvidia Xavier 命令操作链接备忘

Delphi测试题——从Memo去重数据

4.5. File Access Permissions

TMemo

Android源码编译 - JDK的安装