七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

七、CrawlSpider實作微信小程式社群爬蟲

作者：Irain

QQ：2573396010

微信：18802080892

視訊資源連結：https://www.bilibili.com/video/BV1P4411f7rP?p=83.

1 建立項目和CrawlSpider爬蟲

1.1 在DOC視窗建立項目和CrawlSpider爬蟲

scrapy startproject wxapp  # 建立項目
cd wxapp  #  進入項目檔案夾
scrapy genspider -t crawl wxapp_spider "wxapp-union.com"  # 建立爬蟲
# CrawlSpider爬蟲 與 普通爬蟲 建立方式不一樣

1.2 建立項目和CrawlSpider爬蟲示例

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

1.3 pycharm打開項目和CrawlSpider爬蟲

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

2 LinkExtractor

2.1LinkExtractor參數

allow : 設定規則，擷取目标url，排除其他url。

follow : 在目前url下，是否繼續擷取滿足allow規則的url。

callback : 對目前的url進行操作（提取url或元素）。

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

2.2 示例（所有代碼在後面）

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

3 爬取結果展示

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

4 代碼附錄

4.1 wxapp.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'  #  爬蟲名字
    allowed_domains = ['wxapp-union.com']   #  網站域名
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']  # 起始網頁
    rules = (
        #  allow:需要爬取的網頁； follow:是否繼續爬取下一頁
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
        #  callback:提取詳情網頁内容
        #   follow=False: 防止重複爬取。因為：在詳情網頁中出現其他符合爬取規格的網頁。
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback="parse_detail", follow=False),
    )
    def parse_detail(self, response):   #  爬取詳情網頁
        title = response.xpath("//*[@id='ct']/div[1]/div/div[1]/div/div[2]/div[1]/h1/text()").get()  #  提起文章标題
        author = response.xpath("//p[@class='authors']//a/text()").get()  #  提起作者
        time = response.xpath("//*[@id='ct']/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/span/text()").get()  #  提起發表時間
        content = response.xpath("//td//text()").getall()  #  提起文章内容
        content = "".join(content).strip()  #  list類型轉換為字元型
        item = WxappItem(title=title, author=author, time=time, content=content)  #  傳參
        yield item  #  item傳給管道pipelines
        print("=" * 40)

4.2 pipelines.py

from scrapy.exporters import JsonLinesItemExporter #  一條一條資料存入磁盤，占用記憶體小
class WxappPipeline(object):
    def __init__(self):
        self.fp = open("wxjc.json","wb") #  二進制方式寫入，不需要編碼格式
        # ensure_ascii:中文字元
        self.exporters = JsonLinesItemExporter(self.fp,ensure_ascii=False, encoding="utf-8")
    def process_item(self, item, spider):
        self.exporters.export_item(item)  # 輸出資料
        return item
    def close_spider(self,spider):
        self.fp.close()  #  爬蟲結束後，關閉檔案

4.3 items.py

import scrapy
class WxappItem(scrapy.Item):     # 定義容器類
    title = scrapy.Field()
    author = scrapy.Field()
    time = scrapy.Field()
    content = scrapy.Field()

4.4 settings.py 中的參數設定：

參考連結：https://blog.csdn.net/weixin_42122125/article/details/105556273.

釋出日期：2020年4月17日

七、CrawlSpider實作微信小程式社群爬蟲--scrapy爬蟲初學者學習過程-----精通Python爬蟲架構scrapy

七、CrawlSpider實作微信小程式社群爬蟲

1 建立項目和CrawlSpider爬蟲

1.1 在DOC視窗建立項目和CrawlSpider爬蟲

1.2 建立項目和CrawlSpider爬蟲示例

1.3 pycharm打開項目和CrawlSpider爬蟲

2 LinkExtractor

2.1LinkExtractor參數

2.2 示例（所有代碼在後面）

3 爬取結果展示

4 代碼附錄

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入