第一個爬蟲程式，我與爬蟲不得不寫的部落格

目标：成功爬取一個小說網站的某個小說所有内容：

工具：Python3.5，pycharm

曆時：12小時（很多時間都在糾結）

結果：當然是成功了

# -*- coding: utf-8 -*-
import requests
import re
import string
#下載下傳一個網頁
url = 'http://www.jingcaiyuedu.com/book/15401/list.html'
#模拟浏覽器發送http請求,通過requests發送url get請求，伺服器response
# 傳回響應、 資料等
response = requests.get(url)
#規定網頁編碼方式
response.encoding = 'utf-8'
#目标小說首頁源代碼
html = response.text
#小說名字
#
title = re.findall(r'<title>(.*?)</title>', html)
#建立一個檔案，儲存小說内容,with open建立打開一個檔案，‘w'寫的方式打開
fb = open('%s.txt' % title, 'w', encoding='utf-8')
# with open('%s.txt' % title) as f;也可上面那樣寫
#print擷取資料，我們請求的連結是個網頁，是個文本，故加text，亂碼了
#擷取每一章的資訊（章節，url),文本處理，正規表達式,這個不對，根據實際情況爬取
#dl = re.findall(r'<dl id="list">.*?</dl>', html,re.S)[0]#0把清單撥出來
#沒有比對到，涉及到‘*’比對任意字元，但是不比對不可見字元，加參數
# ，+re.S（比對所有字元）.*?非貪婪比對,加[0]把此行從findall清單裡撥出來，

dl = re.findall(r'<dl class="panel-body panel-chapterlist">.*?</dl>', html,re.S)[0]
#print(dl)
#章節清單，提取dl（.*？）比對捕獲傳回。
chapter_info_list = re.findall(r'href="(.*?)" target="_blank" rel="external nofollow" >(.*?)<',dl)
#循環每一個章節，分别下載下傳
for chapter_info in chapter_info_list:
    # chapter_title = chapter_info[1]
    # chapter_url = chapter_info[0]
    #等于上兩句
    chapter_url, chapter_title = chapter_info
    #拼接完整的url
    chapter_url = "http://www.jingcaiyuedu.com%s" % chapter_url
    #print(chapter_info)
    #print(chapter_url, chapter_title)
    #下載下傳章節的内容,拿到了章節的整個的html
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    #提取章節内容,list
    chapter_content = re.findall(r'<div class="panel-body" id="htmlContent">.*?</div>',
                                 chapter_html,re.S)[0]
    #清洗資料
    chapter_content = chapter_content.replace(" ","")
    chapter_content = chapter_content.replace('&nbsp;','')
    chapter_content = chapter_content.replace('<br/>','')
    chapter_content = chapter_content.replace('<br>','')
    chapter_content = chapter_content.replace('<p>','')
    # 提取章節内容這裡不知道為什麼多了這寫内容，沒啥用，替換掉，不知道是不是自己哪裡操作錯了。
    chapter_content = chapter_content.replace('<divclass="panel-body"id="htmlContent">','')
    chapter_content = chapter_content.replace('</div>','')


    # 資料持久化
    fb.write(chapter_title)
    fb.write('\n')
    fb.write(chapter_content)
    fb.write('\n')

    # print(chapter_content)
    # exit()
    print(chapter_url)

坑點：

1.本來要下載下傳小說所在網頁的url,但是沒有完整目錄，隻能找了完整目錄所在網址的url。

2.循環每一個章節，要把需要通路的網址和标題從頁面中提取出來，網址還要拼接成完整的網址，不然下一步進行不了。

3.正規表達式中，findall傳回的是清單，想對其進行清洗和寫入不能用清單格式。每個正則後面加【0】，表示從清單中剝離出來。

4.提取章節内容的時候出現了每章都加上了我的正則内容的情況，不知道原因，為了好看，清洗資料時删掉。疑難點—先放着。

最終得到文本資料如下：

第一個爬蟲程式，我與爬蟲不得不寫的部落格

心得：爬蟲需要好好利用正則，一會把正則再複習下寫個文檔。

第一個爬蟲程式，我與爬蟲不得不寫的部落格

繼續閱讀

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

個人覺得C++BuilderX是個失敗的作品

力扣每日一題：65. 有效數字題目：65. 有效數字解題思路

NOIp模拟題之肮髒的牧師（桶排序）

SQL注入風險小例

比較Flash AS3與AS2特性與功能

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

GSL--GNU Scientific Library

sort()函數到底是怎樣進行數字排序的

neo4j之cypher使用文檔