Java爬蟲學習- WebMagic

詳細資訊見官方文檔

引入maven依賴

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<!--布隆過濾器-->
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>16.0.1</version>
</dependency>

實作PageProcessor

@Component
public class MyProcessor implements PageProcessor {
    
    @Override
    public void process(Page page) {  // process是定制爬蟲邏輯的核心接口，在這裡編寫抽取邏輯  
        //抽取頁面資訊，
        String info=page.getHtml().css("xxx").toString();
        String url=page.getHtml().css("xxx").toString();
        // 儲存
        page.putField("info",info);
        // 從頁面發現後續的url位址來抓取
        page.addTargetRequests(url)
    }

    private Site site=Site.me()
            .setUserAgent(userAgent)//設定userAgent
            .setTimeOut(10000)//設定逾時時間
            .setRetrySleepTime(3000)//設定重試間隔時間
            .setRetryTimes(3)//設定重試次數
            .setCharset("gbk");//設定編碼
    @Override
    public Site getSite() { //  抓取網站的相關配置，包括編碼、抓取間隔、重試次數等
        return site;
    }
    @Autowired
    MyPipeline pipeline;//注入持久化操作的類
    @Scheduled(initialDelay = 1000,fixedDelay = 10000)// 定時任務 initialDelay當任務啟動後等待多久開始執行 fixedDelay每隔多久執行一次方法
    public void process(){
//        //建立下載下傳器，配置代理伺服器
//        HttpClientDownloader downloader=new HttpClientDownloader();
//        downloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("60.167.70.70",41380)));
        Spider.create(new MyProcessor())//傳入目前類
                .addUrl(url)//請求的初始url
                .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(1000000)))//布隆過濾器
                .thread(10)//多線程
                .addPipeline(this.pipeline)//指定持久化操作的類
//              .setDownloader(downloader) //設定下載下傳器
                .run();
    }
}

實作Pipeline

@Component
public class SpringDataPipeline implements Pipeline {

    @Override
    public void process(ResultItems resultItems, Task task) {
		  // ResultItems儲存了抽取結果，它是一個Map結構，
   		 // 在page.putField(key,value)中儲存的資料，可以通過ResultItems.get(key)擷取
    }
}

使用Selectable抽取元素

Selectable相關的抽取元素鍊式API是WebMagic的一個核心功能。使用Selectable接口，你可以直接完成頁面元素的鍊式抽取，也無需去關心抽取的細節
page.getHtml()傳回的是一個 Html 對象，它實作了 Selectable 接口。這個接口包含一些重要的方法，我将它分為兩類：抽取部分和擷取結果部分
1.抽取部分API

方法	說明	示例
css(String selector)	使用Css選擇器選擇	html.css(“div.title”)
links()	選擇所有連結	html.links()
replace(String regex, String replacement)	替換内容	html.replace("","")

2.擷取結果的API

方法	說明	示例
get()	傳回一條String類型的結果	String link= html.links().get()
toString()	功能同get()，傳回一條String類型的結果	String link= html.links().toString()
all()	傳回所有抽取結果	List links= html.links().all()
match()	是否有比對結果	if (html.links().match()){ xxx; }
nodes()	傳回子節點	selectable.nodes()

例如，我們知道頁面隻會有一條結果，那麼可以使用selectable.get()或者selectable.toString()拿到這條結果。

這裡selectable.toString()采用了toString()這個接口，是為了在輸出以及和一些架構結合的時候，更加友善。因為一般情況下，我們都隻需要選擇一個元素！

selectable.all()則會擷取到所有元素。

使用Pipeline儲存結果

WebMagic用于儲存結果的元件叫做 Pipeline

通過“控制台輸出結果”是通過一個内置的Pipeline完成的，它叫做

ConsolePipeline

想把結果用Json格式的檔案儲存下來，隻需要将Pipeline的實作換成

JsonFilePipeline

就可以了

這樣子下載下傳下來的檔案就會儲存在D盤的webmagic目錄中了

通過定制Pipeline，我們還可以實作儲存結果到檔案、資料庫等一系列功能

爬蟲的配置、啟動和終止

Spider

Spider

是爬蟲啟動的入口。在啟動爬蟲之前，我們需要使用一個

PageProcessor

建立一個Spider對象，然後使用

run()

進行啟動。同時Spider的其他元件（Downloader、Scheduler、Pipeline）都可以通過set方法來進行設定

方法	說明	示例
create(PageProcessor)	建立Spider	Spider.create(new MyProcessor())
addUrl(String…)	添加初始的URL	spider .addUrl(“http://webmagic.io/docs/”)
addRequest(Request…)	添加初始的Request	spider .addRequest(“http://webmagic.io/docs/”)
thread(n)	開啟n個線程	spider.thread(5)
run()	啟動，會阻塞目前線程執行	spider.run()
start()/runAsync()	異步啟動，目前線程繼續執行	spider.start()
stop()	停止爬蟲	spider.stop()
test(String)	抓取一個頁面進行測試	spider .test(“http://webmagic.io/docs/”)
addPipeline(Pipeline)	添加一個Pipeline，一個Spider可以有多個Pipeline	spider .addPipeline(new ConsolePipeline())
setScheduler(Scheduler)	設定Scheduler，一個Spider隻能有個一個Scheduler	spider.setScheduler(new RedisScheduler())
setDownloader(Downloader)	設定Downloader，一個Spider隻能有個一個Downloader	spider .setDownloader(new SeleniumDownloader())
get(String)	同步調用，并直接取得結果	ResultItems result = spider .get(“http://webmagic.io/docs/”)
getAll(String…)	同步調用，并直接取得一堆結果	List results = spider .getAll(“http://webmagic.io/docs/”, “http://webmagic.io/xxx”)

Site

對站點本身的一些配置資訊，例如編碼、HTTP頭、逾時時間、重試政策等、代理等，都可以通過設定

Site

對象來進行配置

方法	說明	示例
setCharset(String)	設定編碼	site.setCharset(“utf-8”)
setUserAgent(String)	設定UserAgent	site.setUserAgent(“Spider”)
setTimeOut(int)	設定逾時時間，機關是毫秒	site.setTimeOut(3000)
setRetryTimes(int)	設定重試次數	site.setRetryTimes(3)
setCycleRetryTimes(int)	設定循環重試次數	site.setCycleRetryTimes(3)
addCookie(String,String)	添加一條cookie	site.addCookie(“dotcomt_user”,“code4craft”)
setDomain(String)	設定域名，需設定域名後，addCookie才可生效	site.setDomain(“github.com”)
addHeader(String,String)	添加一條addHeader	site.addHeader(“Referer”,“https://github.com”)
setHttpProxy(HttpHost)	設定Http代理	site.setHttpProxy(new HttpHost(“127.0.0.1”,8080))

Jsoup

Jsoup是Java世界的一款HTML解析工具，它支援用CSS Selector方式選擇DOM元素，也可過濾HTML文本，防止XSS攻擊。

配置代理

從0.7.1版本開始，WebMagic開始使用了新的代理API

ProxyProvider

。因為相對于Site的“配置”，ProxyProvider定位更多是一個“元件”，是以代理不再從Site設定，而是由

HttpClientDownloader

設定

API	說明
HttpClientDownloader.setProxyProvider(ProxyProvider proxyProvider)	設定代理

ProxyProvider

有一個預設實作：

SimpleProxyProvider

。它是一個基于簡單Round-Robin的、沒有失敗檢查的ProxyProvider。可以配置任意個候選代理，每次會按順序挑選一個代理使用。它适合用在自己搭建的比較穩定的代理的場景。

代理示例：

設定單一的普通HTTP代理為101.101.101.101的8888端口，并設定密碼為"username",“password”

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new 	Proxy("101.101.101.101",8888,"username","password")));
    spider.setDownloader(httpClientDownloader);

設定代理池，其中包括101.101.101.101和102.102.102.102兩個IP，沒有密碼

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(
    new Proxy("101.101.101.101",8888)
    ,new Proxy("102.102.102.102",8888)));

使用和定制Pipeline

Pileline是抽取結束後，進行處理的部分，它主要用于抽取結果的儲存，也可以定制Pileline可以實作一些通用的功能

Pipeline的接口定義如下：

public interface Pipeline {

    // ResultItems儲存了抽取結果，它是一個Map結構，
    // 在page.putField(key,value)中儲存的資料，可以通過ResultItems.get(key)擷取
    public void process(ResultItems resultItems, Task task);

}

WebMagic已經提供的幾個Pipeline

類	說明	備注
ConsolePipeline	輸出結果到控制台	抽取結果需要實作toString方法
FilePipeline	儲存結果到檔案	抽取結果需要實作toString方法
JsonFilePipeline	JSON格式儲存結果到檔案

WebMagic爬蟲學習Java爬蟲學習- WebMagic

Java爬蟲學習- WebMagic

引入maven依賴

實作PageProcessor

實作Pipeline

使用Selectable抽取元素

1.抽取部分API

2.擷取結果的API

使用Pipeline儲存結果

爬蟲的配置、啟動和終止

Spider

Site

Jsoup

配置代理

使用和定制Pipeline

WebMagic已經提供的幾個Pipeline

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method