scrapy爬蟲入門學習筆記（一）

scrapy的安裝很簡單，網上有大量相關的教程，自行搜尋。

1.scrapy架構

scrapy爬蟲入門學習筆記（一）scrapy爬蟲入門學習筆記（一）

- 引擎(scrapy engine)

處理系統的資料流，觸發事件

- 排程器(scheduler)

接受引擎發來的請求，壓入隊列，并在引擎再次請求時傳回。可以想象成一個URL的優先隊列，由它決定下一個抓取的網址是什麼，并進行去重。

- 下載下傳器(downloader)

用于下載下傳網頁内容，并傳回給spider，下載下傳器建立于twisted這個高效的異步模型之上

- 爬蟲(spiders)

從特定網頁爬取自己想要的資訊，即item，送出給引擎

- 管道(pipeline)

處理從網頁中提取的item，對資訊進行存儲、寫等處理

- 下載下傳中間件(downloader middlewares)

處理引擎和下載下傳器之間的請求和響應

- 爬蟲中間件(spider middlewares)

處理spider的響應輸入和請求輸出

- 排程中間件(scheduler middlewares)

對引擎的請求進行排程處理

2.scrapy項目文檔目錄

tree

scrapy爬蟲入門學習筆記（一）scrapy爬蟲入門學習筆記（一）

- scrapy.cfg:項目的配置檔案

- items.py:項目的目标檔案。

- pipelines.py:項目的管道檔案，作用就一個，處理item字段

- settings.py:項目的設定檔案

- spiders:存儲爬蟲代碼目錄

items.py

- 定義結構化的字段，用于儲存爬取到的資料，類似于python中的字典，但提供了額外的保護防止錯誤。

- 建立一個scrapy.Item類，定義類型為scrapy.Field的類屬性來定義一個Item

- 建立一個ITcastItem類，建構Item模型

import scrapy
class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    # l老師的職稱
    title=scrapy.Field()
    # 老師資訊
    info=scrapy.Field()
    # pass

itcast.py

- 爬蟲python檔案

- 定義一個ITcastSpider類，繼承自scrapy.spider

- 需要name、start_urlds和parse

- parse函數中的xpath是xml文檔查找資訊的語言，可用于對xml元素和屬性值進行周遊

class ItcastSpider(scrapy.Spider):
    # 爬蟲名，啟動爬蟲時需要的必須參數
    name = 'itcast'
    # 爬取域範圍，允許爬蟲在這個域名下進行爬取（可選）
    # allowed_domains = []
    # 起始URL清單，爬蟲執行後第一批請求，将從這個清單中擷取
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):
        node_list=response.xpath("//div[@class='li_txt']")
        # 存儲所有的items字段
        items=[]
        for node in node_list:
            # 建立item字段，用來存儲資訊
            item=ItcastItem()
            # xpath對象轉成Unicode字元串，使用extract()
            name=node.xpath("./h3/text()").extract()

            title=node.xpath("./h4/text()").extract()
            info=node.xpath("./p/text()").extract()
            item['name']=name[]
            item['title']=title[]
            item['info']=info[]
            items.append(item)
        # 傳回給引擎了
        return items

yield item//item傳回給pipeline

- yield和return的差別：yield送出但并不傳回，依然繼續執行。return之後函數就結束了。yield隻是在程式執行中送出一個結果。如果所有資料全部儲存在一個清單中，等到程式執行完畢後return，這樣記憶體開銷會很大，而且意外中斷資料就丢失了。

pipelines.py和settings.py

Item 在spider中收集到後，将會傳遞到pipeline，pipeline元件按照響應碼的順序處理item，主要包含以下操作：

- 驗證爬取的資料，比如是否包含某些字段，如name

- 查重并丢棄，URL的查重由排程器完成。資料的查重需要自己做，使用集合set

- 将爬取的結果儲存到檔案或資料庫中。

#pipelines.py
import json
import codecs

class ItcastPipeline(object):
    # 第二個是必選的，1和3是可選的，如果需要操作本地磁盤就需要1和3
    # 初始化隻執行一次
    def __init__(self):
        self.f=codecs.open("itcast_pipeline.json",'w',encoding='utf-8')

    # 擷取每一個item
    def process_item(self, item, spider):
        content=json.dumps(dict(item),ensure_ascii=False)+', \n'
        self.f.write(content)
        # 傳回一個item給引擎，告訴引擎item已經處理好了，可以處理下一個item
        # 如果有其他管道檔案，就會交給下一個item
        # 也就是說每一個管道類都需要一個returnitem
        return item
    # 爬蟲關閉也隻執行一次
    def close_spider(self,spider):
        self.f.close()
# 可以定義處理資料庫的管道檔案，但是需要在setings.py中設定

pipelines.py中的設定需要在settings.py中對應的設定才會生效。

#setings.py
#配置pipeline，每個管道類後跟一個響應碼，依照從小到大優先
ITEM_PIPELINES = {
   'ITcast.pipelines.ItcastPipeline': ,
}

3.項目初體驗

項目目的

爬取Tencent招聘資訊，URL：https://hr.tencent.com/position.php?&start=0

建立項目

scrapy startproject tencent

items.py

首先在TencentItem類中定義字段，确定要哪些資訊

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    positionName=scrapy.Field()
    positiontype=scrapy.Field()
    requireNum=scrapy.Field()
    positionLocation=scrapy.Field()
    positionTime=scrapy.Field()
    # pass

建立spider

scrapy genspider Tencent "tencent.com"

寫spider類

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    name = 'Tencent'
    allowed_domains = ['tencent.com']
    baseurl='https://hr.tencent.com/position.php?&start='
    offset=
    start_urls=[]
    for i in range():
        start_urls.append(baseurl+str(offset+*i))

    def parse(self, response):
        #xpath讀取節點清單
        node_list=response.xpath("//tr[@class='even'] | //tr[@class='odd']")
        # print(len(node_list))
        for node in node_list:
            item=TencentItem()
            item['positionName']=node.xpath("./td[1]/a/text()").extract()[]
            item['positionLink']=node.xpath("./td[1]/a/@href").extract()[]
            if len(node.xpath("./td[2]/text()")):
                item['positiontype']=node.xpath("./td[2]/text()").extract()[]
            item['requireNum']=node.xpath("./td[3]/text()").extract()[]
            item['positionLocation']=node.xpath("./td[4]/text()").extract()[]
            item['positionTime']=node.xpath("./td[5]/text()").extract()[]
            yield item
        # if self.offset<3930:
        #     self.offset+=10
        #     url=self.baseurl+str(self.offset)
        #     yield scrapy.Request(url,callback=self.parse)

這裡可以采用多種方法繼續爬取下一頁的資料，可以采用

yield scrapy.Request(url,callback=self.parse)

送出疊代的URL給引擎。也可以把所有的URL一并放到

start_urls

裡面。後一種方法采用并發執行，效率更高。

寫pipeline類

import json

class TencentPipeline(object):
    def __init__(self):
        self.f=open("tencent.json",'w',encoding='utf-8')
    def process_item(self, item, spider):
        contents=json.dumps(dict(item),ensure_ascii=False)+',\n'
        self.f.write(contents)
        return item
    def close_spider(self,spider):
        self.f.close()

執行

scrapy crawl Tencent

scrapy爬蟲入門學習筆記（一）scrapy爬蟲入門學習筆記（一）

scrapy爬蟲入門學習筆記（一）

1.scrapy架構

2.scrapy項目文檔目錄

3.項目初體驗

繼續閱讀

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql優化

線程通信和程序通信差別（線程程序差別）

Matlab随機波動率SV、GARCH用MCMC馬爾可夫鍊蒙特卡羅方法分析匯率時間序列

微信小程式前端解密擷取使用者資訊

Spring MVC 自學雜記（五） -- SpringMVC與前台的json資料互動

《MySQL技術内幕：InnoDB存儲引擎》筆記

擴容TIKV節點遇到的坑

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

PHP輔導代做程式設計：CS353 Database System

自學Zabbix3.10.2-事件通知Notifications upon events-Actions報警配置點選傳回：自學zabbix集錦

HDU 5678 ztr loves trees

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

二叉樹及其應用--二叉樹建立

sort()函數到底是怎樣進行數字排序的

詳解STM32單片機的堆棧