java網絡爬蟲開發筆記（2）

問題

上次講到jsoup對于response header裡面的Location項有誤讀，後來又發現這種現象的更根本的原因是jsoup裡面經過了兩次的urlEncode過程，于是最初的連結

http://example.com/你好

第一次被轉換成

http://example.com/%e4%bd%a0%e5%a5%bd

第二次被轉換成

http://example.com/%25e4%25bd%25a0%25e5%25a5%25bd

再去通路就直接404了，這樣不行。

解決方案

這個問題的解決方案其實說起來特别傻：抛棄jsoup的http部分隻用它來解析html，與http協定打交道的活兒交給另一個注明的java網絡庫來做：httpclient。（主要是因為httpclient我不是第一次打交道了，以前也用過，其他還可以用的諸如jetty，netty之類的當然也可以。我的結論是jsoup的http子產品有問題，不夠成熟，因為我都是直接照着最基礎的教程寫出來的，如果是我使用不當的原因的話，歡迎指出。）

于是原先的代碼是（看起來很簡單，但是麻煩重重）：

public static Document parse(String url) throws IOException {
    return Jsoup.connect(url).get();
}

現在是（看起來比較複雜，但是穩定沒bug）：

public static Document parse(String url) throws IOException {
    CloseableHttpClient client = HttpClients.createDefault();
    HttpGet get = new HttpGet(url);
    HttpResponse response = client.execute(get);
    return Jsoup.parse(response.getEntity().getContent(), "UTF-8", url);
}

跑了一遍沒啥問題，原先解析不出來要報錯的連結都解出來了，很爽。

如果我用jsoup的時候出毛病真的是因為我用的姿勢不對，還請各位斧正。

問題

在爬國内網站的時候速度确實顯著地比國外網站要快很多，然而還是不夠盡如人意：爬完1000左右的頁面仍然要10分鐘左右，這個速度離所謂的“快”還差得遠。profiler跑一下，主要的時間都花在網絡通路上，而且還是阻塞的——一個頁面加載完整後才能找到其中的所有連結，才能将他們壓入到隊列裡面去。

對此，一個很容易想到的解決方案就是：線程池。創造多個線程來消費“待爬頁面”隊列，提高速度，然而這個解答并不如它看起來的那麼顯然：

常見的線程池，包括java.util.concurrent包裡面的諸如Executor之類，都基于生産者-消費者分離的模型。也就是說，生産者隻管生産，消費者隻管消費，兩者互不幹擾，是以才會有Executor裡面的這樣一段注釋：

Executor不會自動停止，需要調用shutdown()方法指令它在正在執行的所有任務執行完成之後自動停止。

能夠這樣說，基于一個非常簡單的事實：當生産者停止時，我就可以毫無顧慮地保證不會産生新的需求，進而指令線程池停止。

可惜的是，我們現在遇到的情況卻不是這麼簡單。

加載+解析完一個網頁之後，很有可能根據裡面的

<a>

來找到新的待解析的頁面。也就是說，消費者本身也是生産者。如果僅僅在隊列為空之後就調用Executor的

shutdown()

方法的話，就會導緻這些正在執行的任務所創造的需求被忽略了。

最極端的情況下，在隊列的第一個（也就是最初的一個）連結被取出之後，因為

queue.isEmpty()

為

true

，循環立刻結束，真正爬到的頁面隻有這一個，這顯然不是我們想要的。

那麼問題就是，如何確定所有的任務都正确地結束了呢？也就是說，目前隊列為空，并且線程池裡面所有的線程都執行完畢，不會創造新的需求？

解決方案

苦心人，天不負，多番嘗試之後，我在stack overflow上找到了這樣一個回答：awaitTermination of all recursively created tasks

java網絡爬蟲開發筆記（2）

照裡面說的寫了

InverseSemaphore.java

，然後再上

ExecutorService

，10個線程一起開動，那叫一個爽啊！一分半就扒了1000個不同的頁面（當然還有爆滿的mysql dashboard）。

也差不多是時候貼一下代碼了：

package com.std4453.crawerlab.main;

import com.std4453.crawlerlab.db.DB;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class CrawlerTest {
    private DB db;
    private ExecutorService executor;
    private InverseSemaphore semaphore;

    public CrawlerTest() {
        this.db = new DB();
        this.semaphore = new InverseSemaphore();
    }

    // ====== CRAWLING BEHAVIOR ======

    private void processPage(String url) {
        try {
            // check whether the given url is in the database
            String sql = "SELECT * FROM Record WHERE URL = '" + url + "';";
            ResultSet result = this.db.runSQL(sql);

            if (!result.next()) {
                // store url into database
                sql = "INSERT INTO record (URL) VALUES (?);";
                PreparedStatement statement = this.db.connection.prepareStatement(sql,
                        Statement.RETURN_GENERATED_KEYS);
                statement.setString(, url);
                statement.execute();

                // fetch page
                Document doc;
                try {
                    doc = this.parse(url);
                    if (this.matches(doc))
                        this.foundUrl(url);
                } catch (IOException e) {
                    System.err.println("Unable to fetch url: " + url);
                    e.printStackTrace();
                    return;
                }

                // crawl
                Elements links = doc.select("a[href]");
                for (Element link : links) {
                    String href = link.attr("abs:href");
                    if (this.inRange(href))
                        this.submit(href);
                }
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // task completed
            this.semaphore.taskCompleted();
        }
    }

    private Document parse(String url) throws IOException {
        CloseableHttpClient client = HttpClients.createDefault();
        HttpGet get = new HttpGet(url);
        HttpResponse response = client.execute(get);
        return Jsoup.parse(response.getEntity().getContent(), "UTF-8", url);
    }

    private void submit(final String url) {
        this.semaphore.beforeSubmit();
        this.executor.submit(() -> CrawlerTest.this.processPage(url));
    }

    public void run() throws IOException, InterruptedException {
        this.beforeRun();

        // before
        try {
            db.runSQL2("TRUNCATE Record;");
        } catch (SQLException e) {
            e.printStackTrace();
        }
        this.executor = Executors.newFixedThreadPool();

        // crawl
        this.submit(this.startPage());

        // after
        this.semaphore.awaitCompletion();
        this.executor.shutdown();
        this.executor.awaitTermination(, TimeUnit.MINUTES);

        this.afterRun();
    }

    // ====== CRAWLING LOGIC ======

    private String startPage() {
        return "http://www.zhangxinxu.com";
    }

    private boolean inRange(String url) {
        return url.contains("zhangxinxu.com");
    }

    private boolean matches(Document unused) {
        return true;
    }

    private PrintWriter out;

    private void beforeRun() throws IOException {
        this.out = new PrintWriter(new FileOutputStream(new File("output.txt")));
    }

    private void afterRun() {
        this.out.close();
    }

    private void foundUrl(String line) {
        this.out.println(line);
    }

    // ====== PROGRAM ENTRANCE ======

    public static void main(String[] args) throws Exception {
        CrawlerTest crawlerTest = new CrawlerTest();
        crawlerTest.run();
    }
}

其中DB和InverseSemaphore兩個類就是兩篇文章中一模一樣的，一點都沒改（除了包名），是以就不貼了。整個程式精煉小巧，150行都不到，卻能從根部扒出整一個站點的所有頁面，可謂驚人。

小結論

java作為如今web的主要語言之一，其上下遊部件的完整性自然是不容小觑的。任何有一定java基礎的人，都可以像我這樣，稍稍研究一陣，就能寫出一個實際能跑的網絡爬蟲出來。

本系列《java網絡爬蟲開發筆記》到這裡當然也遠遠稱不上完結，正如我在本部落格的第一篇文章裡面說的一般，部落格的存在就是為了總結經驗教訓，而我在這樣一個起步階段，可供總結的經驗教訓還多得很，自然不敢妄談完結。明天的本系列第三篇将會介紹爬蟲進一步的優化和調整的步驟，也願有意學習這方面的朋友借鑒我的學習道路，共同提高自身。

（代碼打打怎麼都一點多了。。睡覺睡覺。。明天要起不來了。。）

java網絡爬蟲開發筆記（2）

問題

解決方案

問題

解決方案

小結論

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method