。。。閑來無事,爬了一下我最愛的B站~~~卒
首先進入B站的番劇索引頁
ps:以前經常浏覽這個索引頁找動漫看,是以熟練的操作~滑稽
翻頁發現url連結并沒有改變,用谷歌開發者工具network發現加載了XHR檔案并傳回json格式的響應
放到atom裡看下資料是咋樣的
要對其進行翻頁處理,觀察一下query string的規律,發現那麼多個參數隻有page這個參數是變化的
是以接下來都很好做了~嘻嘻
items.py
import scrapy
from scrapy import Field
class BilibiliItem(scrapy.Item):
title = Field()
cover = Field()
sum_index = Field()
is_finish = Field()
link = Field()
follow = Field()
plays = Field()
score = Field()
_id = Field()
bzhan.py
import scrapy
import demjson #這個庫要pip一哈
from scrapy.selector import Selector
from bilibili.items import BilibiliItem
from random import randint
class BzhanSpider(scrapy.Spider):
name = 'bzhan'
allowed_domains = ['bilibili.com']
start_urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20']
def parse(self, response):
json_content = demjson.decode(response.body)
datas = json_content["result"]["data"]
item = BilibiliItem()
for data in datas:
cover = data['cover']
sum_index = data['index_show']
is_finish = data['is_finish']
is_finish = '已完結' if is_finish == 1 else '未完結'
link = data['link']
follow = data['order']['follow']
plays = data['order']['play']
try:
score = data['order']['score']
except:
score = '未知'
title = data['title']
item['_id'] = title
item['cover'] = cover
item['sum_index'] = sum_index
item['is_finish'] = is_finish
item['link'] = link
item['follow'] = follow
item['plays'] = plays
item['score'] = score
item['title'] = title
yield item
urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page={0}&season_type=1&pagesize=20'.format(k) for k in range(2,156)]
for url in urls:
request = scrapy.Request(url,callback=self.parse)
yield request
利用python對象字典的方式進行解析。。不難
piplines.py
import pymongo
class BilibiliPipeline(object):
def process_item(self, item, spider):
client = pymongo.MongoClient('localhost', 27017)
mydb = client['mydb']
bilibili = mydb['bilibili']
bilibili.insert_one(item)
print(item)
return item
settings.py略。。。。。。
結果可以爬取到三千多個資料
心疼我的b站一秒。。