6.1 爬取目标

用Chrome浏覽器，打開http://seputu.com網址的那一刻，小夥伴們都驚呆了，第1部到第8部都有。爬取内容：把書名，章節名稱，章節連結抽取出來，同時将每個文章的内容提取出來，儲存成txt檔案。将書名作為檔案夾名稱，将章節名稱作為txt檔案名。

python爬取重名div_第6天 | 10天搞定Python網絡爬蟲，爬盜墓筆記，牛6.1 爬取目标6.2 節點定位裡面，文章的内容節點是在div.bg > div.content > div.single-content > div > div.content裡的6.3 爬小說

6.2 節點定位

用F12檢視源代碼，你會發現htm的頁面l結構非常分明，書名的html節點是// div[ @ class = "mulu-title"] / center / h2裡面，章節節點的内容(章節名稱和連結)是// div[ @ class = "box"] / ul标簽中。Xpath規則挺簡單的。

點選第1章節連結，進入詳細頁面，連結位址為：,頁面詳細内容分析，标題節點是在div.bg > h1的

裡面，文章的内容節點是在div.bg > div.content > div.single-content > div > div.content裡的

标簽裡面。Xpath的規則挺簡單的。

6.3 爬小說

爬取小說内容有點多，程式設計不要急，理清思路，逐漸實作功能。小說标題，作為檔案夾名稱。每章詳細内容，存到對應的檔案裡面。在換頁爬取時，暫停2秒鐘，模拟真人點選連結的行為，否則會被認為是爬蟲。

import requestsfrom requests.exceptions import RequestExceptionfrom lxml import etreefrom lxml.etree import ParseErrorimport timeimport os# 擷取小說資料class Storybook:    headers = {        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "                      "(KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"    }    def __init__(self, link):        self.link = link    def get_detail(self, title, section, href):        try:            # 建立檔案            filename = title + "/" + section + ".txt"            fd = open(filename, mode="w", encoding="utf-8")            res = requests.get(href, headers=self.headers)            # 解決亂碼問題            res.encoding = 'utf-8'            # 解析HTML文本内容            html = etree.HTML(res.text)            contents = html.xpath('// div[ @ class = "content-body"] / p')            for p in contents:                data = p.text.strip() + ""                fd.write(data)            fd.close()        except RequestException:            print("請求伺服器異常")        except ParseError:            print("資料解析錯誤")        except IOError:            print("建立檔案夾失敗")    def get_page(self):        try:            res = requests.get(self.link, headers=self.headers)            # 解析HTML文本内容            html = etree.HTML(res.text, etree.HTMLParser())            # 定位至需要的标簽            titles = html.xpath('// div[ @ class = "mulu-title"] / center / h2')            contents = html.xpath('// div[ @ class = "box"] / ul ')            i = 0            for ul in contents:                # 标題為檔案夾名稱                title = titles[i].text                # 建立檔案夾                os.mkdir(title)                a_list = ul.xpath('li /a')                for a in a_list:                    # 章節名稱                    section = a.text                    href = a.get("href")                    # 爬取小說詳細内容                    self.get_detail(title, section, href)                    # 暫停時間2秒，進入下一個詳細内容                    time.sleep(2)                # 暫停2秒，進入下一章                time.sleep(2)                i += 1            return None        except RequestException:            print("請求伺服器異常")            return None        except ParseError:            print("資料解析錯誤")            return None        except IOError:            print("建立檔案夾失敗")            return Noneif __name__ == '__main__':    sb = Storybook("http://seputu.com")    sb.get_page()

輸出結果

好了，有關爬取小說的内容，老陳講完了，如果覺得對你有所幫助，希望老鐵能轉發點贊，讓更多的人看到這篇文章。你的轉發和點贊，就是對老陳繼續創作和分享最大的鼓勵。

一個當了10年技術總監的老家夥，分享多年的程式設計經驗。想學程式設計的朋友，可關注今日頭條：老陳說程式設計。分享Python，前端(小程式)、App和嵌入式方面的幹貨。關注我，沒錯的。

#Python##網絡爬蟲##程式員##爬蟲##盜墓筆記#

python爬取重名div_第6天 | 10天搞定Python網絡爬蟲，爬盜墓筆記，牛6.1 爬取目标6.2 節點定位裡面，文章的内容節點是在div.bg &gt; div.content &gt; div.single-content &gt; div &gt; div.content裡的6.3 爬小說

6.1 爬取目标

6.2 節點定位

裡面，文章的内容節點是在div.bg > div.content > div.single-content > div > div.content裡的

6.3 爬小說

python爬取重名div_第6天 | 10天搞定Python網絡爬蟲，爬盜墓筆記，牛6.1 爬取目标6.2 節點定位裡面，文章的内容節點是在div.bg > div.content > div.single-content > div > div.content裡的6.3 爬小說