Java|“ Java”來爬取小說章節

1 前提簡介

在上一篇Java|使用WebMagic進行電話爬取“的文章裡就已經介紹了如何主要使用Pageprocessor去爬取電話号碼，接下來将要學習到的是去爬取起點中文網的小說，并且按照小說名和章節分别儲存。

2 簡單檢視

下面就是需要去爬取的小說頁面以及内容，但儲存下來的檔案隻需要章節内容，像第一章的開頭就不需要，于是需要注意去判斷。

圖2.1 起點中文網

圖2.2 玄幻新書

圖2.3 反派強無敵

圖2.4 章節内容

3 代碼及注釋

話不多說，需要的解釋都以注釋的形式寫在代碼裡，下面就來看看詳細的代碼，值得注意的是内容的是xpath不要寫錯，否則可能會導緻失敗：

package com.yellow.java_pachong.book;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.Site;

import us.codecraft.webmagic.Spider;

import us.codecraft.webmagic.processor.PageProcessor;

import us.codecraft.webmagic.selector.Html;

import us.codecraft.webmagic.selector.Selectable;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.PrintWriter;

import java.util.ArrayList;

import java.util.List;

/**

* 爬取起點小說

public class GetQidianBook implements PageProcessor {

//設定帶爬取檔案的相關配置

private Site site = Site.me()

.setCharset("utf-8")//設定字元集

.setTimeOut(1000)//設定逾時時間

.setSleepTime(1000);//設定休眠時間

//書的名字

String bookName1 = "";

@Override

public Site getSite() {return site;}

//爬取資料邏輯

//第一級URL https://www.qidian.com/xuanhuan 擷取書欄目錄

//第二級 https://book.qidian.com/info/1019251979#Catalog 章節目錄

//第三級 https://read.qidian.com/chapter/SaT8jsiJD54smgY_yC2imA2/oQbX6YtwB_NOBDFlr9quQA2 章節内容

public void process(Page page) {

//擷取URL

Selectable table = page.getUrl();

//System.out.println(table);

//URL比對用.{23}去代替字元比對，每個章節的字尾不一樣

if (table.regex("https://read.qidian.com/chapter/.{23}/.{23}").match()) {//文章章節頁面

//擷取html頁面資訊

Html html = page.getHtml();

//列印html

//System.out.println(html);

//章節标題

String title = "";

//内容集合

List<String> content = new ArrayList<String>();

//抓取有用資訊

//判斷是否是第一章

if (html.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/div[1]/div[1]/div[1]/h1/text()").toString() != null) {//是第一章

//擷取書名

bookName1 = html.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/div[1]/div[1]/div[1]/h1/text()").toString();

//System.out.println(bookName);

//擷取章節名

title = html.xpath("[@class='main-text-wrap']/div[1]/h3/span/text()").toString();

//System.out.println(title);

//擷取文章内容

content = html.xpath("[@class='main-text-wrap']/div[2]/p/text()").all();

} else {//不是第一章

title = html.xpath("[@id='j_chapterBox']/div[1]/div[1]/div[1]/h3/span/text()").toString();

//擷取文章内容

content = html.xpath("[@id='j_chapterBox']/div[1]/div[1]/div[2]/p/text()").all();

}

//存到本地

downBook(bookName1, title, content);

}else if(table.regex("https://book.qidian.com/info/\\d{10}#Catalog").match()){//書的章節目錄

//擷取每一章節的位址,在章節目錄裡每一章的xpath

List<String> url = page.getHtml().xpath("[@class='volume-wrap']/div[1]/ul/li/a/@href").all();

//加入待爬取序列

page.addTargetRequests(url);

}else{//一級url

//擷取Html頁面

//解析出每本書的url

List<String> url = html.xpath("[@id='new-book-list']/div/ul/li/div[2]/h4/a/@href").all();

//拼接成完整的路徑

List<String> url2 = new ArrayList<String>();

for (String string : url) {

url2.add(string + "#Catalog");

page.addTargetRequests(url2);

}

//将書存入本地

private void downBook(String bookName2, String title, List<String> content) {

//判斷目錄存不存在

File file = new File("D:/book.xuanhuan/" + bookName2);

if(!file.exists()){

file.mkdirs();

PrintWriter pw = null; //使用IO流

try {

//存為txt檔案及其路徑

FileOutputStream fos = new FileOutputStream("D:/book.xuanhuan/" + bookName2 + "/" + title + ".txt");

pw = new PrintWriter(fos,true);

for (String string : content) {

pw.println(string);

//爬完一章列印

System.out.println(title + " " + "爬取完畢");

} catch (FileNotFoundException e) {

e.printStackTrace();

} finally {//關流

pw.close();

}

//建立線程

public static void main(String[] args) {//爬取了玄幻類的書

Spider.create(new GetQidianBook()).thread(1).addUrl("https://www.qidian.com/xuanhuan").run();

}

4結果展示

首先是控制台的的列印：

圖4.1 控制台列印

然後是儲存檔案路徑：

圖4.2 檔案路徑

最後是章節内容：

圖4.3 章節内容

這樣就自動規整地爬取到了書籍。

Java|“ Java”來爬取小說章節

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method