使用java 來爬取網頁内容

前言
明确爬取對象
實作需求
- 小說名稱
- 章節内容
源代碼
結語

前言

在日常中，我們經常需要浏覽網頁，閱讀一些内容。

但網頁中并不是所有内容都是我們所需要的。

畢竟，誰都不想看的好好時突然出現一個“澳門棋牌”。

那麼這時我們就可以爬取它的内容。

明确爬取對象

這裡就以大家熟知的筆*閣為例。

打開筆*閣的首頁。

不對，打開一本小說。

這裡以《進化的四十六億重奏》為例（我是挺推薦這本書的，還有，如果可以的話盡量支援正版。）

使用java 來爬取網頁内容前言明确爬取對象實作需求源代碼結語

打開首頁，檢視源代碼，我們可以從其中換取我們需要的一起。

那我們需要什麼呢？

那我們就需要明确我們爬取的對象。

1 小說的名稱。

2 章節名稱。

3 章節内容。

ok，明确了對象後，那我們就需要針對這些對象進行爬取。

實作需求

小說名稱

首先是小說的名稱。

通過觀察源代碼，我們可以看到：

使用java 來爬取網頁内容前言明确爬取對象實作需求源代碼結語

小說的名稱和簡介是儲存在：

使用java 來爬取網頁内容前言明确爬取對象實作需求源代碼結語

11，12行的标簽中的。

章節内容

小說的目錄和章節内容是儲存在：

使用java 來爬取網頁内容前言明确爬取對象實作需求源代碼結語

标簽中的。

我們可以将網頁的内容全部存入一個字元串數組中。

然後進行比較，确定位置。

然後将我們需要的内容提取出來。

在放入新的檔案中。

話不多說，上執行個體：

源代碼

public static void main(String[] args){
        //确定首頁連結
        String link = "https://www.biquwx.la/0_376/";
        //确定檔案存放位置
        String path = "/Users/apple/Downloads/test/";




        //預設運作一次，當連接配接不上連結時（也就是出現SSLException異常時），runTime會+1，也就是仔運作一次
        int runTimes = 1;
        for (int runtime=0;runtime<runTimes;runtime++){
            try{

                //建立URL對象
                URL url = new URL(link);
                //打開連接配接
                URLConnection urlConnection = url.openConnection();
                //建立Http的連接配接
                HttpsURLConnection connection = (HttpsURLConnection) urlConnection;
                //建立流
                InputStreamReader isr = new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8);
                BufferedReader br = new BufferedReader(isr);
                //現在網頁中的内容已經讀進了BufferedReader中。
                System.out.println("連接配接成功！");

                //建立一個容器，容器的長度會随内容的增加而增加，不用擔心數組越界的問題
                ArrayList al = new ArrayList();
                //while循環，一直執行直到，line == null時停止，也就是網頁内容讀完後停止。
                while (true){
                    //建立字元串，用于儲存一行的内容
                    String line = br.readLine();
                    //如果line == null ，則跳出循環
                    if (line == null){
                        break;
                    }
                    //如果line ！= null，那麼把字元串添加進容器中。
                    else {
                        al.add(line);
                    }
                }

                //關閉流
                br.close();
                isr.close();

                //while循環完畢後，網頁中的内容也都被放進了容器中
                //現在将容器中的内容放入一個字元串數組中
                //字元串的長度就為容器的大小
                String[] str = new String[al.size()];
                for (int i=0;i<str.length;i++){
                    str[i] = String.valueOf(al.get(i));
                }

                //如果出現亂碼，則重新運作
                //其實，沒多大用，我也不知道為什麼會出現亂碼。
                //但一般等會在運作就好了
                //如果有辦法的，可以在評論中告訴我，謝謝。
                if (!str[0].contains("<!DOCTYPE html")){
                    System.out.println(str[9]);
                    System.out.println("出現亂碼，重新連接配接中...");
                    Thread.sleep(3000);
                    runTimes++;
                    continue;
                }

                //初始化标題
                String name = "";
                //初始化簡介
                String description = "";

                /*
                通過觀察我們可以發現，一行的字元串中，我們所需的全在""中。
                那麼我們可以，以"為分割符，建立一個字元串數組，然後提取我們所需的。
                 */

                //建立循環，判斷我們所需内容
                for (int i = 0;i<str.length;i++){
                    if (str[i].contains("property=\"og:title\"")){
                        //建立一個字元串數組，以"為分割符
                        String[] temp = str[i].split("\"");
                        //标題位于這個數組的第4位
                        name = temp[3];
                        continue;
                    }else if (str[i].contains("property=\"og:description\"")){
                        //簡介不一樣，它不僅占了一行，它占據了<meta property="og:description" content="/>這個标簽。
                        //而随後的标簽是<meta property="og:image"  那我們就可以檢測這個标簽，來作為結束。

                        //i1是用來計算開始到結束的行數的
                        int i1 = 1;
                        while (true){
                            if (str[i+i1].contains("<meta property=\"og:image\"")){
                                for (int i2 = i;i2<i1+i;i2++){
                                    //因為隻有少量字元串拼接，是以我就用了String的+
                                    description += str[i2];
                                }
                                break;
                            }else i1++;
                        }

                        //現在description不僅包含了簡介還包含了标簽，是以要像小說名一樣操縱一下。

                        //建立一個字元串數組，以"為分割符
                        String[] temp = description.split("\"");
                        //标題位于這個數組的第4位
                        description = temp[3];
                    }
                }
                System.out.printf("小說的名字為：%n" + name + "%n");
                System.out.printf( "小說的簡介為: %n" + description);
                System.out.println("--------------");
                System.out.println("正在擷取章節内容中");
                //小說名 和 簡介 我們都有了
                //現在就是小說章節了
                //依舊由觀察可知，章節在<div id="list"> 标簽中
                //而每個章節的前面都會有href 和 title ，我們就從這倆下手
                //例：<a href="3102496.html" target="_blank" rel="external nofollow"  title="目前細胞的一些資料及第一卷解釋">目前細胞的一些資料及第一卷解釋</a>

                //現将包含章節的行提取到一個字元串數組，再進行操縱
                //同上依舊先用容器裝，在轉成字元串數組
                ArrayList al2 = new ArrayList();
                for (int i = 0;i<str.length;i++){
                    if (str[i].contains("href") && str[i].contains("title")){
                        al2.add(str[i]);
                    }
                }

                String[] chapter = new String[al2.size()];
                for (int i=0;i<chapter.length;i++){
                    chapter[i] = String.valueOf(al2.get(i));
                }
                //現在已經轉完了，那麼我們就可以進行操縱了

                //建立檔案放置目錄
                File directory = new File(path + name);
                System.out.println("已建立檔案" + directory);

                for (int i=0;i<chapter.length;i++){
                    //将一個字元串以"為分割符分割
                    String[] temp = chapter[i].split("\"");
                    //章節名位于第四位
                    String chapterName = temp[3];
                    //章節連結位于第二位
                    String chapterLink = temp[1];

                    //首先初始化上，下一章的名稱
                    String nextChapterName = "";
                    String beforeChapterName = "";
                    if (i != chapter.length-1){
                        //擷取下一章的名稱
                        String[] temp1 = chapter[i+1].split("\"");
                        nextChapterName =  temp1[3];
                    }

                    if (i != 0){
                        //擷取上一章的名稱
                        String[] temp2 = chapter[i-1].split("\"");
                        beforeChapterName = temp2[3];
                    }

                    //現在要讀取章節中我們需要的内容

                    //建立流來輸入
                    //建立一個檔案，檔案名為獲得的章節名
                    File f = new File(path + name + "/" + chapterName + ".html");
                    System.out.print("正在建立檔案" + f + "             ");
                    if (!f.exists()){
                        f.getParentFile().mkdirs();
                    }
                    try (
                            FileOutputStream fos = new FileOutputStream(f);
                            PrintWriter pw = new PrintWriter(fos)
                    ){

                        //建立URL對象
                        URL url1 = new URL(link + chapterLink);
                        //打開連接配接
                        URLConnection urlConnection1 = url1.openConnection();
                        //建立Http的連接配接
                        HttpsURLConnection connection1 = (HttpsURLConnection) urlConnection1;
                        //建立流
                        InputStreamReader isr1 = new InputStreamReader(connection1.getInputStream(), StandardCharsets.UTF_8);
                        BufferedReader br1 = new BufferedReader(isr1);
                        //現在網頁中的内容已經讀進了BufferedReader中。
                        //打開一個一個章節的源碼，我們可以看正文部分前面都有空格辨別符&nbsp;  是以我們可以從這個下手.
                        //例:  &nbsp;&nbsp;&nbsp;&nbsp;在城市的街道上，所有的虛民都在瘋狂地奔跑着，在大地的震顫之下，它們紛紛從建築之中逃了出來……

                        //用一個容器，和一個字元串數組就夠了
                        //用StringBuffer更節約性能
                        ArrayList al3 = new ArrayList();
                        while (true){
                            String line = br1.readLine();
                            if (line == null){
                                break;
                            }else {al3.add(line);}
                        }

                        String[] content = new String[al3.size()];
                        for (int i1= 0;i1<al3.size();i1++){
                            content[i1] = String.valueOf(al3.get(i1));
                        }

                        StringBuffer sb = new StringBuffer();
                        for (int i1 = 0;i1<content.length;i1++){
                            if (content[i1].contains("&nbsp;")){
                                sb.append(content[i1]);
                            }
                        }

                        //我們要建立一個html檔案，用于實作方向鍵換章。
                        //這裡可以讀取一個檔案模版，但我懶的做了是以直接寫這了.
                        pw.println("<!DOCTYPE html>");
                        pw.println("<html en\">");
                        pw.println("<head>");
                        pw.println("    <meta charset=\"UTF-8\">");
                        pw.println("    <title>"+ chapterName + "</title>");
                        pw.println("    <script>function onDocKeydown(e) {e = e || window.event;if (e.keyCode==39) {");
                        pw.println("                window.location.href=\"" + directory + "/"+ nextChapterName + ".html" +"\";");
                        pw.println("            }else if (e.keyCode==37){");
                        pw.println("                window.location.href=\"" + directory + "/" + beforeChapterName + ".html" + "\";");
                        pw.println("            }}document.onkeydown = onDocKeydown;</script>");
                        pw.println("</head>");
                        pw.println("<body>");
                        pw.println("<div align=\"center\">");
                        pw.println("    <h1>"+ chapterName +"</h1></br></br>");
                        pw.println(sb);
                        pw.println("</div>");
                        pw.println("</body>");
                        pw.println("</html>");
                        System.out.println("檔案建立完畢");

                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }

            } catch (Exception e) {
                //如果異常為SSLException，那麼跳出循環，在運作一次
                if (e instanceof SSLException){
                    System.out.println("出現異常：未連接配接成功");
                    System.out.println("嘗試再次運作...");
                    runTimes++;
                }e.printStackTrace();
            }
        }
    }

結語

還是希望支援正版。

有問題放在評論區，若我看到了，會給你盡快回複的。

如果文章内容有什麼問題也歡迎指正。

感謝你的閱讀

使用java 來爬取網頁内容前言明确爬取對象實作需求源代碼結語

使用java 來爬取網頁内容

前言

明确爬取對象

實作需求

小說名稱

章節内容

源代碼

結語

繼續閱讀

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

sort()函數到底是怎樣進行數字排序的

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method