本文主要是對CrawlSpider爬蟲的應用示例
資料爬取對象:中華網科技類新聞
url:https://tech.china.com/articles/
1. 建立項目
scrapy startproject zhonghuawang
2. 建立CrawlSpider爬蟲
cd zhonghuawang
scrapy genspider -t crawl china tech.china.com
3. 頁面分析
清單頁:
通過F12開發者工具,你會發現所有的資訊都包含在div[@class="m2left topborder"]這個标簽裡面,下面的每個字标簽div[@class="con_item"]裡面就包含每條新聞的資訊,包括連結,通過觀察,你會發現每條新聞的詳情頁連結的規律了
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38CXlZHbvN3cpR2Lc1TPB10QGtWUCpEMJ9CXsxWam9CXwADNvwVZ6l2c052bm9CXUJDT1wkNhVzLcRnbvZ2Lc1TPB9ENFRVTxMGVPpHOsJGcohVYsR2MMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2LcRHelR3LcJzLctmch1mclRXY39DN2kjN1ETNwEDOycDM4EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
這樣我們就可以寫詳情頁的連結提取規則了,加之之後我們要在詳情頁裡面提取需要的資料,則肯定需要有回調函數(callback)去處理
Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
接下來,我們再分析一下怎麼翻頁,通過點選下一頁就很容易發現規律了
然後我們再通過F12開發者工具去檢視下面翻頁頁面的資訊
那麼我們下一條Rule規則頁就出來了,但它隻需要提取連結,并不需要做其它的操作,是以不需要回調函數
Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html'))
完整的Rule規則如下:
rules = (
Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
)
詳情頁:
我們需要提取的資訊是:标題,連結,文章内容,發表時間,來源以及站點名稱,确定好需要提取的哪些資訊字段後,我們就可以寫items.py檔案了
import scrapy
class ZhonghuawangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 标題
title = scrapy.Field()
# 連結
url = scrapy.Field()
# 正文
content = scrapy.Field()
# 釋出時間
datetime = scrapy.Field()
# 來源
source = scrapy.Field()
# 站點名稱
website = scrapy.Field()
接下來就是通過F12開發者工具去檢視需要提取的資訊分别在哪個标簽下,然後使用xpath規則進行提取
def parse_item(self, response):
item = ZhonghuawangItem()
# 标題
item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
# 連結
item['url'] = response.url
# 内容
# item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
# 釋出時間
item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
# 來源
# item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
# 站點名稱
item['website'] = '中華網'
yield item
4. 儲存資訊
對需要的資訊提取完成後,最後就寫入到json檔案進行儲存(你可以儲存到資料庫中),那麼我們就要寫pipelines.py檔案了
import json
class ZhonghuawangPipeline(object):
def __init__(self):
self.filename = open('zhonghuawnag.json', 'w')
def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.filename.write(text)
return item
def close_spider(self, spider):
self.filename.close()
5. 修改settings.py檔案
#不遵守robots.txt協定
ROBOTSTXT_OBEY = False
#不使用Cookies
COOKIES_ENABLED = False
#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
}
#開啟piprlines
ITEM_PIPELINES = {
'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}
6. 運作代碼即可
scrapy crawl china
完整代碼如下:
items.py
import scrapy
class ZhonghuawangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 标題
title = scrapy.Field()
# 連結
url = scrapy.Field()
# 正文
content = scrapy.Field()
# 釋出時間
datetime = scrapy.Field()
# 來源
source = scrapy.Field()
# 站點名稱
website = scrapy.Field()
china.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem
class ChinaSpider(CrawlSpider):
name = 'china'
allowed_domains = ['tech.china.com']
start_urls = ['http://tech.china.com/articles/']
rules = (
Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
)
def parse_item(self, response):
item = ZhonghuawangItem()
# 标題
item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
# 連結
item['url'] = response.url
# 内容
# item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
# 釋出時間
item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
# 來源
# item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
# 站點名稱
item['website'] = '中華網'
yield item
pipelines.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem
class ChinaSpider(CrawlSpider):
name = 'china'
allowed_domains = ['tech.china.com']
start_urls = ['http://tech.china.com/articles/']
rules = (
Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
)
def parse_item(self, response):
item = ZhonghuawangItem()
# 标題
item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
# 連結
item['url'] = response.url
# 内容
# item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
# 釋出時間
item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
# 來源
# item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
# 站點名稱
item['website'] = '中華網'
yield item
settings.py
#不遵守robots.txt協定
ROBOTSTXT_OBEY = False
#不使用Cookies
COOKIES_ENABLED = False
#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
}
#開啟piprlines
ITEM_PIPELINES = {
'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}