Python 爬蟲2例:爬網絡小說

2022-11-23 18:53:46

程式邏輯:按給出的第一章節URL，抓HTML,然後通過正規表達式，取出小說章節的标題、正文、下章節的URL，然後跳轉到下一章節，不斷循環處理。取出的正文寫入文本檔案。同時記錄每次取過的URL，如果網絡異常了，重新開機程式，可以從檔案中取URL繼續上次的抓取任務。

正則，對應如下圖:

#!/usr/bin/python
# -*- coding: gbk -*-

# by gnolux 20190526
# email: [email protected]

from urllib import request
import re
import os
import socket

#socket.setdefaulttimeout(60)

#第一章節的URL
url = 'https://www.23us.la/html/203/203086/1130014.html'
#URL前辍,用來拼接出下章節的絕對URL用
url_prex = 'https://www.23us.la/html/203/203086/'

#取章節标題的正規表達式
title_match_str = r'<h1>(.*?)</h1>'
#取章節正文的正規表達式
content_match_str = r'<div id="content">(.*?)</div>'
#取下一章節URL的正規表達式(相對URL)
next_page_match_str = r'<div class="link">.*?傳回清單</a>→<a href="(.*?)">下一章</a>'

#儲存的檔案名
save_file = 'd:/test/tclys.txt'



#如果已經運作過，則記錄了最後一次運作的下一面位址，從檔案中取url
last_url_file = '%s.url.txt'%save_file
if os.path.isfile(last_url_file):
    with open(last_url_file, 'r') as f:
        url = f.read()

headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
#print(url)
for page in range(0,99999999):
    req=request.Request(url=url,headers=headers)
    url = ''
    #如果timeout嘗試重新送出，最多嘗試8次
    for t in range(0,8):
        try:
            response = request.urlopen(req,timeout=4)
            html = response.read()
            #print(html)
            html = html.decode("utf-8")
            title,content= '',''

            s=re.findall(title_match_str,html,re.S)
            if len(s) == 1:
                title=s[0]
            s=re.findall(content_match_str,html,re.S)
            if len(s) == 1:
                content = s[0]
                content=content.replace('<br/>','').replace(' ','').replace('<br />','')

            #print(title,content)
            
            with open(save_file, 'a',encoding='utf-8') as f:
                f.write("%s\n%s\n"%(title,content))

            s=re.findall(next_page_match_str,html,re.S)
            if len(s) == 1:
                url = '%s%s'%(url_prex,s[0])
            else:
                print(s)

            #記錄下一章位址到檔案，出現網絡異常時，可以續傳
            with open(last_url_file, 'w') as f:
                f.write(url)

            #print('next url:%s'%url)
            print(title)
            #print(content)

            break;#成功則跳出重試。

        except Exception as e:      #抛出逾時異常
            print('第%d次嘗試連接配接'%t, str(e))
    if url == '':
        break;

Python 爬蟲2例:爬網絡小說

繼續閱讀

web前端布局練手項目

Django之驗證碼（十七）驗證碼

Vue項目 - 單檔案元件和Vue中的路由

龍珠訓練營task04

趕工心得（一）

一個小小的移動web版音樂播放器

Docker - Dockerfile之ADD、COPY、WORKDIR、USER、EXPOSE指令詳解

Compile workrave under windows &ndash; My exprience 在Windows上編譯Workrave

門戶通專訪草根站長九天狼：做站貴在堅持

GSL--GNU Scientific Library

tabpanel 使用問題

為什麼把CSS放頭部，script放下面

CSS之折疊菜單

web開發之前後端渲染

403 Forbidden，You don't have permission to access / on this server.Forbidden

neo4j之cypher使用文檔