【Python】Scrapy爬虫介绍&&re爬虫移植到Scrapy爬虫

Scrapy
- Scrapy爬虫框架
- - requests和Scrapy对比
  - Scrapy常用命令
- Scrapy 爬虫基本使用
- - 第一个Scrapy实例
  - Scrapy爬虫的基本使用
Scrapy爬虫实例编写/re爬虫移植
- 豆瓣Top100爬虫
- 豆瓣Top100爬虫源re代码
- 环境配置参考
- 实例程序参考
- 程序编写
- - 步骤1:建立工程
  - 步骤2：建立Spider.模板
  - 步骤3:编写 Spider
  - 步骤4:配置并发连接选项，优化爬取速度：编写 Pipelines
- 爬取豆瓣top250的scrapy爬虫
- - 参考
  - douban.py
  - pipelines处理数据，保存文件
  - settings.py设置文件
  - 运行爬虫
  - 报错处理
  - 修正后的`douban.py`版本

Scrapy

Scrapy爬虫框架

Scrapy爬虫框架结构

爬虫框架

爬虫框架是实现爬虫功能的一个软件结构和功能组件集合。

爬虫框架是一个半成品，能够帮助用户实现专业网络爬虫。

五个主要模块，两个中间件。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Engine(不需要用户修改)

控制所有模块之间的数据流
根据条件触发事件

Downloader

根据请求下载网页
不需要用户修改

Scheduler

不需要用户修改
对所有爬取请求进行调度管理

Spider(最核心)

解析Downloader返回的响应(Response)
产生爬取项(Scraped item)
产生额外的爬取请求(Request)

Item Pipelines

以流水线处理Spider的爬取项
由一组操作顺序组成，类似一个流水线，每个操作是一个Item Pipeline类型。
可能操作包括：清理，检验和查重爬取项中的HTML数据，将数据保存到数据库。

中间件：Downloader Middleware

目的：实施 Engine、 Scheduler和 Downloader之间进行用户可配置的控制

功能：修改、丢弃、新增请求或响应

Spider Middleware

目的：对请求和爬取项的再处理

功能：修改、丢弃、新增请求或爬取项

用户可以编写配置代码

requests和Scrapy对比

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Scrapy常用命令

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Scrapy 爬虫基本使用

第一个Scrapy实例

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

DemoSpider这个类叫什么名字无所谓，只要他是继承于scrapy.Spider的子类即可。

parse()用于处理响应，解析内容形成字典，发现新的URL爬取请求。

产生步骤

步骤1:建立一个 Scrapy爬虫工程
步骤2:在工程中产生一个 Scrapy爬虫
步骤3:配置产生的 spider爬虫
步骤4:运行爬虫，获取网页。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

cd pycodes
scrapy startproject python123demo
cd python123demo
scrapy genspider demo python123.io#生成一个名称为demo的spider，生成了demo.py

yield关键字。生成器。生成器是一个不断产生值的函数。

包含 yield语句的函数是一个生成器

生成器每次产生一个值( yield语句)，函数被冻结，被唤醒后再产生一个值。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

生成器，产生小于n的数的平方值。并将所有的返回值返回给上层调用函数。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

普通写法：存储所有的数。

生成器写法：每次只要一个值。每次的存储空间是一个的存储空间。

当n很大时，生成器就有很大的优势。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Scrapy爬虫的基本使用

Scrapy爬虫的使用步骤

步骤1:刨建一个工程和 Spider模板
步骤2:编写 Spider
步骤3:编写 Item Pipeline
步骤4:优化配置策略

Scrap爬虫的数据类型

Request类
Response类
Item类

Request类

class scrapy.http.Request()
Request对象表示ー个HTTP请求。

由 Spider生成，由 Downloader.执行。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Response类

class scrapy.http.Response()
Response对象表示一个HTTP响应。

由 Downloader生成，由 Spider处理。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Item类

class scrap.item.Item()
Item对象表示一个从HTML页面中提取的信息内容。

由 Spider生成，由 Item Pipelines处理。

爬取信息，封装成字典，存储。

Scrapy爬虫支持多种HTML信息提取方法

Beautiful Soup
lxml
re
Xpath Selector
CSS Selector

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

Scrapy爬虫实例编写/re爬虫移植

豆瓣Top100爬虫

技术路线： scrapy

目标：获取上交所和深交所所有股票的名称和交易信息

输出：保存到文件中

豆瓣Top100爬虫源re代码

import re
import requests
from bs4 import BeautifulSoup
import xlwt
import time
def get_one_page(url):
	hd={
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
	try:
		r = requests.get(url, timeout=30,headers=hd)
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except:
		return "geturl wrong!"


def parser_one_page(s):
	s=re.sub('<br>',' ',s)
	s=re.sub('&nbsp;','',s)
	soup=BeautifulSoup(s,"html.parser")
	list=soup.find_all('div','item')
	for item in list: 
		item_name = item.find(class_='title').string
		item_img = item.find('a').find('img').get('src')
		item_index = item.find(class_='').string
		item_score = item.find(class_='rating_num').string
		item_author = item.find('p').text
		if (item.find(class_='inq') != None):
			item_intr = item.find(class_='inq').string
		print('爬取电影：' + item_index + ' | ' + item_name +' | ' + item_img +' | ' + item_score +' | ' + item_author +' | ' + item_intr )
		#print('爬取电影：' + item_index + ' | ' + item_name + ' | ' + item_score + ' | ' + item_intr+' | '+item_img)
		global n
		sheet.write(n, 0, item_name)
		sheet.write(n, 1, item_img)
		sheet.write(n, 2, item_index)
		sheet.write(n, 3, item_score)
		sheet.write(n, 4, item_author)
		sheet.write(n, 5, item_intr)
		n = n + 1


if __name__ == '__main__':
	book = xlwt.Workbook(encoding='utf-8', style_compression=0)
	sheet = book.add_sheet('豆瓣电影Top250', cell_overwrite_ok=True)
	sheet.write(0, 0, '名称')
	sheet.write(0, 1, '图片')
	sheet.write(0, 2, '排名')
	sheet.write(0, 3, '评分')
	sheet.write(0, 4, '作者')
	sheet.write(0, 5, '简介')
	n= 1
	urls=['https://movie.douban.com/top250?start={}&filter='.format(str(i)) for i in range(0,250,25)]
	for url in urls:
		html=get_one_page(url)
		parser_one_page(html)
		time.sleep(2)
	book.save(u'豆瓣最受欢迎的250部电影.csv')

环境配置参考

pycharm中创建scrapy项目

实例程序参考

import scrapy


class NewsSpider(scrapy.Spider):
    name = 'news'
    allowed_domains = ['hitwh.edu.cn']
    start_urls = ['http://today.hitwh.edu.cn/1024/list.htm']

    def decode(self,s):
        return bytes(s,encoding='utf-8')
    def parse(self, response):
        num = response.css(".all_pages ::text").extract()[0];
        for i in range(int(num)):
            try:
                url = "http://today.hitwh.edu.cn/1024/list"+str(i+1)+".htm"
                # print(url)
                yield scrapy.Request(url,callback=self.parse_pages)
            except:
                print("WRONG ON PAGE",str(i+1))
                continue
    def parse_pages(self,response):
        # fname = "result.txt"
        # # print("unbegin")
        #with open(fname,"wb") as f:
        # total={}
        for item in response.css("#righ_list li"):
            url= "http://today.hitwh.edu.cn"+item.css("a ::attr(href)").extract()[0]
            yield scrapy.Request(url,callback=self.parse_detail)
            
            # total.update({"题目:":item.css("a ::text").extract()[0],
            # "时间":item.css("font ::text").extract()[2]})
            
            #print(item.css("a ::text").extract()[0]);
            # f.write(self.decode(item.css("a ::text").extract()[0]))
            # f.write(self.decode(item.css("font ::text").extract()[2]))
            # f.write(self.decode("\n"))  
                
                # print(item.css("font ::text").extract());    
        
        #     f.write(str(response.css("#righ_list a::text")))
        # self.log('Saved file %s.' % fname)
    def parse_detail(self,response):
    # fname = "result.txt"
    # # print("unbegin")
    #with open(fname,"wb") as f:
        total={}
        time = response.css(".newsNav ::text").extract()[2]
        time = time[6:-2]
        author = response.css(".newsNav ::text").extract()[0]
        author = author[5:-4]
        if (author==''):
            author="无"
        total.update({"题目":response.css(".newsTitle ::text").extract()[0],"时间":time,"作者":author})
        yield total




# pipelines

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json

class HitwhPipeline:
    def process_item(self, item, spider):
        return item
class pagePipeline:
    def open_spider(self,spider):
        self.f = open("information.json","w",encoding='utf-8')
    def close_spider(self,spider):
        self.f.close()
    def process_item(self, item, spider):
        try:
            line=str(json.dumps(dict(item),ensure_ascii=False))+"\n"
            self.f.write(line)
        except Exception as e:
            print (str(e))
            print("ERROR")
            pass
        return item

程序编写

步骤

步骤1:建立工程和 Spider模板

步骤2:编写 Spider

步骤3:编写 ITEM Pipelines

步骤1:建立工程

打开pycharm，自己在自己喜欢的路径下新建一个项目

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

步骤2：建立Spider.模板

scrapy startproject Web_scrapy
cd Web_scrapy
scrapy genspider douban movie.douban.com/top250

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

进一步修改

spiders/douban.py

文件

步骤3:编写 Spider

配置

douban.py

文件

修改对返回页面的处理

修改对新增URL爬取请求的处理

步骤4:配置并发连接选项，优化爬取速度：编写 Pipelines

配置 pipelines.py
文件

定义对爬取项( Scraped Item)的处理类

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

爬取豆瓣top250的scrapy爬虫

参考

Python yield 使用浅析

CSS – Python爬虫常用CSS选择器（Selectors）

爬虫Scrapy框架之css选择器使用

Scrapy：运行爬虫程序的方式

scrapy css选择器提取a标签内href属性值

douban.py

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

使用css选择器的方法是

选择href属性包含

https://movie.douban.com/subject/

这个字符串的所有元素。

import scrapy


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['http://movie.douban.com/top250/']

    def decode(self,s):#设定以utf-8编码解析数据
        return bytes(s,encoding='utf-8')
    def parse(self, response):
        for i in range(25,225,25):
            try:
                url="http://movie.douban.com/top250?start="+str(i)
                yield scrapy.Request(url,callback=self.parse_pages)
                '''
                一个带有 yield 的函数就是一个 generator，它和普通函数不同，生成一个 generator 
                看起来像函数调用，但不会执行任何函数代码，直到对其调用 next()（在 for 循环中会自动调用 next()）才开始执行。
                虽然执行流程仍按函数的流程执行，但每执行到一个 yield 语句就会中断，并返回一个迭代值，下次执行时从 yield 的下一个语句继续执行
                看起来就好像一个函数在正常执行的过程中被 yield 中断了数次，每次中断都会通过 yield 返回当前的迭代值。
                '''
            except:
                print("wrong page",str(i))
                continue

    def parse_pages(self,response):
        for item in response.css("[href~=https://movie.douban.com/subject/]"):
        #选择href属性包含字符串https://movie.douban.com/subject/的标签，实际上已经直接选中了a标签
        #for item in response.css(".hd")
            url=item.css("a::attr(href)").extract[0]
            yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        total={}
        number=response.css(".top250-no::text").extract[0]
        name=response.css("div>h1::text").extract[0]
        year=response.css(".year::text").extract[0]
        #对year进行更为美观的处理，使用strip函数去掉开头和结尾的(和)
        year.strip("()")
        score=response.css(".ll rating_num::text").extract[0]
        total.update({"排名":number,"电影名":name,"上映年份":year,"电影评分":score})
        yield total

pipelines处理数据，保存文件

import json

class WebScrapyPipeline:
    def process_item(self, item, spider):
        return item

class pagePipeline:
    def open_spider(self,spider):
        self.f=open("information.json","w",encoding='utf-8')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(json.dumps(dict(item), ensure_ascii=False)) + "\n"
            self.f.write(line)
        except Exception as e:
            print(str(e))
            print("ERROR")
            pass
        return item

settings.py设置文件

请求头

DEFAULT_REQUEST_HEADERS = {
  'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  #'Accept-Language': 'en',
}

管道

ITEM_PIPELINES = {
   'Web_scrapy.pipelines.WebScrapyPipeline': 300,
   'Web_scrapy.pipelines.pagePipeline':300
}

运行爬虫

示例一：全局运行

示例二：项目级运行

<spider> 是一个爬虫程序的名称——爬虫类里面的name属性（必须required，且在项目中具有唯一性unique）。

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

报错处理

问题一：

修改url池的提取：

问题二：

TypeError: 'method' object is not subscriptable

问题解决办法解释：

Python TypeError: ‘method’ object is not subscriptable Solution

#错误代码：
url=item.css("a::attr(href)").extract[0]

#正确代码：
url=item.css("a::attr(href)").extract()[0]

问题三：

报错：

参考解决：

IndexError: list index out of range and python

修正后的 `douban.py` 版本

直接在F12下copy对应标签的selector

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

import scrapy


class DoubanSpider(scrapy.Spider):
    name = 'douban'
#    allowed_domains = ['movie.douban.com/top250']
    allowed_domains = ['movie.douban.com']
    start_urls = ['http://movie.douban.com/top250/']

    def decode(self,s):#设定以utf-8编码解析数据
        return bytes(s,encoding='utf-8')
    def parse(self, response):
        #爬取第一页
        url = "http://movie.douban.com/top250"
        yield scrapy.Request(url, callback=self.parse_pages)
        #爬取后面几页
        for i in range(25,250,25):
            try:
                url="http://movie.douban.com/top250?start="+str(i)
                yield scrapy.Request(url,callback=self.parse_pages)
                '''
                一个带有 yield 的函数就是一个 generator，它和普通函数不同，生成一个 generator
                看起来像函数调用，但不会执行任何函数代码，直到对其调用 next()（在 for 循环中会自动调用 next()）才开始执行。
                虽然执行流程仍按函数的流程执行，但每执行到一个 yield 语句就会中断，并返回一个迭代值，下次执行时从 yield 的下一个语句继续执行
                看起来就好像一个函数在正常执行的过程中被 yield 中断了数次，每次中断都会通过 yield 返回当前的迭代值。
                '''
            except:
                print("wrong page",str(i))
                continue

    def parse_pages(self,response):
        #选择href属性包含字符串https://movie.douban.com/subject/的标签，实际上已经直接选中了a标签
        #for item in response.css(".hd")
        for item in response.css("div.hd>a"):
            print(item)
            url=item.css("a::attr(href)").extract()[0]
            yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        total={}
        number=response.css("span.top250-no::text").extract()[0]
        name=response.css("#content > h1 > span:nth-child(1)::text").extract()[0]
        year=response.css("#content > h1 > span.year::text").extract()[0]
        #对year进行更为美观的处理，使用strip函数去掉开头和结尾的(和)
        #year.strip("()")
        #year=year.replace('(','').replace(')','')
        score=response.css("#interest_sectl > div.rating_wrap.clearbox > div.rating_self.clearfix > strong::text").extract()
        total.update({"排名":number,"电影名":name,"上映年份":year,"电影评分":score})
        yield total

【Python】Scrapy爬虫介绍&amp;&amp;requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

【Python】Scrapy爬虫介绍&&re爬虫移植到Scrapy爬虫

Scrapy

Scrapy爬虫框架

requests和Scrapy对比

Scrapy常用命令

Scrapy 爬虫基本使用

第一个Scrapy实例

Scrapy爬虫的基本使用

Scrapy爬虫实例编写/re爬虫移植

豆瓣Top100爬虫

豆瓣Top100爬虫源re代码

环境配置参考

实例程序参考

程序编写

步骤1:建立工程

步骤2：建立Spider.模板

步骤3:编写 Spider

步骤4:配置并发连接选项，优化爬取速度：编写 Pipelines

爬取豆瓣top250的scrapy爬虫

参考

douban.py

pipelines处理数据，保存文件

settings.py设置文件

运行爬虫

报错处理

修正后的 douban.py 版本

继续阅读

【Python】Scrapy爬虫介绍&&requests爬虫移植到Scrapy爬虫ScrapyScrapy爬虫实例编写/re爬虫移植

修正后的 `douban.py` 版本