WebMagic

WebMagic 介紹

WebMagic基礎架構

Webmagic 的結構分為 Downloader、PageProcessor、Scheduler、Pipeline四大元件，并由 Spider将他們彼此組織起來。這四種元件對應爬蟲生命周期中的下載下傳、處理、管理和持久化等功能。Spider将這幾個元件組織起來，讓他們可以互互相動，流程化的執行，可以認為Spider是一個大容器，也是WebMagic邏輯的核心。架構圖如下：

WebMagic 爬蟲技術WebMagic

WebMagic 的四大元件

Downloader：負責從網際網路下載下傳頁面，以便後續處理。WebMagic預設使用了Apache HttpClient作為下載下傳工具。

PageProcessor：負責解析頁面，抽取有用資訊，以及發現新的連結。WebMagic 使用 Jsoup 作為 HTML 解析工具，并基于其開發了解析 XPath 的工具 Xsoup。

Scheduler：負責管理待抓取的URL，以及一些去重的工作。WebMagic預設提供了JDK的記憶體隊列來管理 URL，并用集合來進行去重。也支援使用Redis 進行分布式管理。

Pipeline：負責抽取結果的結果，包括計算、持久化到檔案、資料庫等。WebMagic 預設提供了“輸出到控制台”和“儲存到檔案”兩個結果處理方案。

用于資料流轉的對象

Request：是對 URL 位址的一層封裝，一個 Request 對應一個 URL 位址。它是 PageProcessor 與 Downloader 互動的載體，也是 PageProcessor 控制 Downloader 唯一方式。

在這裡插入代碼片

Page：代表了從 Downloader 下載下傳到的一個頁面 – 可能是 HTML，也可能 JSON 或者其他文本格式的内容。Page 是 WebMagic 抽取過程的核心對象，它提供一些方法可供抽取、結果儲存等。

ResultItems：相當于一個 Map，它儲存 PageProcessor 處理的結果，供 Pipeline 使用。（當字段 skip 設定為 true，則不應被 Pipeline 處理）

WebMagic 功能

實作 PageProcessor

抽取元素 Selectable

WebMagic 裡主要使用了三種抽取技術：Xpath、正規表達式和 css選擇器。另外，對于 JSON 格式的内容，可使用 JsonPath 進行解析。

Xpath：

page.getHtml().Xpath("//div[@class=mt]/h1/text()")

CSS 選擇器：

page.getHtml().css("div.p_in li.bk")

正規表達式：

html.css("div.t1 span").regex(".*釋出")

抽取元素 API

當鍊式調用結束時，我們一般都想要拿到一個字元串類型的結果。這時候就需要用到擷取結果的API了。

方法	說明	示例
xpath(String xpath)	使用XPath選擇	html.xpath("//div[@class=‘title’]")
$(String selector)	使用Css選擇器選擇	html.$(“div.title”)
$(String selector,String attr)	使用Css選擇器選擇	html.$(“div.title”,“text”)
css(String selector)	功能同$()，使用Css選擇器選擇	html.css(“div.title”)
links()	選擇所有連結	html.links()
regex(String regex)	使用正規表達式抽取	html.regex("(.*?)")

擷取結果 API

方法	說明	示例
get()	傳回一條String類型的結果	String link= html.links().get()
toString()	傳回一條String類型的結果	String link= html.links().toString()
all()	傳回所有抽取結果	List links= html.links().all()

☆ 當有多條資料的時候，使用get()和toString()都是擷取第一個url位址；all()則會擷取到所有元素。

擷取連結

//  擷取下一列的url
String bkUrl = page.getHtml().css("div.p_in li.bk").nodes().get(1).links().toString();
//  把url放到任務隊列中
page.addTargetRequest(bkUrl);

使用 Pipeline 儲存結果

WebMagic用于儲存結果的元件叫做 Pipeline

儲存到檔案中

public static void main(String[] args) {
    Spider.create(new JobProcessor())
            //初始通路url位址
            .addUrl(url)
            .addPipeline(new FilePipeline("D:/webmagic/"))
            .thread(5)//設定線程數
            .run();
}

儲存到資料庫中

@Autowired
private SpringDataPipeline springDataPipeline;

//  initialDelay：當任務啟動後，等等多久執行方法
//  fixedDelay：每隔多久執行方法
@Scheduled(initialDelay = 1000,fixedDelay = 10000)
public void process(){
    Spider.create(new JobProcess())
            .addUrl(url)
            .thread(10)
            .addPipeline(springDataPipeline)
            .run();
}

SpringDataPipeline 類

@Component
public class SpringDataPipeline  implements Pipeline {

    @Autowired
    private JobInfoService jobInfoService;


    @Override
    public void process(ResultItems resultItems, Task task) {
        //擷取封裝好的招聘詳情對象
        JobInfo jobInfo = resultItems.get("jobInfo");

        //判斷資料是否不為空
        if (jobInfo != null) {
            //如果不為空把資料儲存到資料庫中
            this.jobInfoService.save(jobInfo);
        }
    }
}

爬蟲的配置、啟動和終止

Spider

Spider 是爬蟲啟動的入口。在啟動爬蟲之前，我們需要使用一個 PageProcessor 建立一個 Spider 對象，然後使用 run() 啟動。

public void process(){
    Spider.create(new JobProcess())
            ...
            .run();
}

方法	說明	示例
create(PageProcessor)	建立Spider	Spider.create(new GithubRepoProcessor())
addUrl(String…)	添加初始的URL	spider .addUrl(spider .addUrl(“https://www.baidu.cn/”))
thread(n)	開啟n個線程	spider.thread(5)
run()	啟動，會阻塞目前線程執行	spider.run()
start()/runAsync()	異步啟動，目前線程繼續執行	spider.start()
stop()	停止爬蟲	spider.stop()
addPipeline(Pipeline)	添加一個Pipeline，一個Spider可以有多個Pipeline	spider .addPipeline(new ConsolePipeline())
setScheduler(Scheduler)	設定Scheduler，一個Spider隻能有個一個Scheduler	spider.setScheduler(new RedisScheduler())
setDownloader(Downloader)	設定Downloader，一個Spider隻能有個一個Downloader	spider .setDownloader(new SeleniumDownloader())
get(String)	同步調用，并直接取得結果	ResultItems result = spider.get(“http://webmagic.io/docs/”)
getAll(String…)	同步調用，并直接取得一堆結果	List results = spider .getAll(“http://webmagic.io/docs/”, “http://webmagic.io/xxx”)

爬蟲配置Site

Site.me()可以對爬蟲進行一些配置配置，包括編碼、抓取間隔、逾時時間、重試次數等。

private Site site = Site.me()
        .setCharset("gbk")//設定編碼
        .setTimeOut(10 * 1000)//設定逾時時間
        .setRetrySleepTime(3000)//設定重試的間隔時間
        .setRetryTimes(3);//設定重試的次數

方法	說明	示例
setCharset(String)	設定編碼	site.setCharset(“utf-8”)
setUserAgent(String)	設定UserAgent	site.setUserAgent(“Spider”)
setTimeOut(int)	設定逾時時間，機關是毫秒	site.setTimeOut(3000)
setRetryTimes(int)	設定重試次數	site.setRetryTimes(3)
setCycleRetryTimes(int)	設定循環重試次數	site.setCycleRetryTimes(3)
addCookie(String,String)	添加一條cookie	site.addCookie(“dotcomt_user”,“code4craft”)
setDomain(String)	設定域名，需設定域名後，addCookie才可生效	site.setDomain(“github.com”)
addHeader(String,String)	添加一條addHeader	site.addHeader(“Referer”,“https://github.com”)
setHttpProxy(HttpHost)	設定Http代理	site.setHttpProxy(new HttpHost(“127.0.0.1”,8080))

WebMagic 爬蟲技術WebMagic

WebMagic

WebMagic 介紹

WebMagic基礎架構

WebMagic 的四大元件

用于資料流轉的對象

WebMagic 功能

實作 PageProcessor

抽取元素 Selectable

抽取元素 API

擷取結果 API

擷取連結

使用 Pipeline 儲存結果

爬蟲的配置、啟動和終止

Spider

爬蟲配置Site

繼續閱讀

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

sort()函數到底是怎樣進行數字排序的

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method