天天看點

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

。。。閑來無事,爬了一下我最愛的B站~~~卒

首先進入B站的番劇索引頁

ps:以前經常浏覽這個索引頁找動漫看,是以熟練的操作~滑稽

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

翻頁發現url連結并沒有改變,用谷歌開發者工具network發現加載了XHR檔案并傳回json格式的響應

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

放到atom裡看下資料是咋樣的

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

要對其進行翻頁處理,觀察一下query string的規律,發現那麼多個參數隻有page這個參數是變化的

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

是以接下來都很好做了~嘻嘻

items.py

import scrapy
from scrapy import Field

class BilibiliItem(scrapy.Item):

    title = Field()
    cover = Field()
    sum_index = Field()
    is_finish = Field()
    link = Field()
    follow = Field()
    plays = Field()
    score = Field()
    _id = Field()
    
           

bzhan.py

import scrapy
import demjson #這個庫要pip一哈
from scrapy.selector import Selector
from bilibili.items import BilibiliItem
from random import randint

class BzhanSpider(scrapy.Spider):
    name = 'bzhan'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20']

    def parse(self, response):
        json_content = demjson.decode(response.body)
        datas = json_content["result"]["data"]
        item = BilibiliItem()
        for data in datas:
            cover = data['cover']
            sum_index = data['index_show']
            is_finish = data['is_finish']
            is_finish = '已完結' if is_finish == 1 else '未完結'
            link = data['link']
            follow = data['order']['follow']
            plays = data['order']['play']

            try:
                score = data['order']['score']
            except:
                score = '未知'
            title = data['title']

            item['_id'] = title
            item['cover'] = cover
            item['sum_index'] = sum_index
            item['is_finish'] = is_finish
            item['link'] = link
            item['follow'] = follow
            item['plays'] = plays
            item['score'] = score
            item['title'] = title

            yield item
        urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page={0}&season_type=1&pagesize=20'.format(k) for k in range(2,156)]
        for url in urls:
            request = scrapy.Request(url,callback=self.parse)
            yield request
            
           

利用python對象字典的方式進行解析。。不難

piplines.py

import pymongo

class BilibiliPipeline(object):
    def process_item(self, item, spider):
        client = pymongo.MongoClient('localhost', 27017)
        mydb = client['mydb']
        bilibili = mydb['bilibili']
        bilibili.insert_one(item)
        print(item)
        return item
        
           

settings.py略。。。。。。

結果可以爬取到三千多個資料

Scrapy執行個體:爬取B站所有動漫番劇資訊(Ajax接口+json資料解析)

心疼我的b站一秒。。