前言

今天我們就用scrapy爬取知乎表情包。讓我們愉快地開始吧~

開發工具

Python版本：3.6.4

環境搭建

安裝Python并添加到環境變量，pip安裝需要的相關子產品即可。

原理簡介

原理其實蠻簡單的，因為之前就知道知乎有個api一直可以用：

https://www.zhihu.com/node/QuestionAnswerListV2
post請求這個連結，攜帶的資料格式如下：

data = {
           'method': 'next',
           'params': '{"url_token":%s,"page_size":%s,"offset":%s}'
}
1. url_token：
問題id，譬如問題“https://www.zhihu.com/question/302378021”的問題id為302378021
2. page_size：
每頁回答的數量(知乎最大隻能是10)
3. offset：
目前顯示的回答的偏移量

就可以獲得該問題下的所有答案啦，然後用正規表達式提取每個回答下的所有圖檔連結就OK了。

具體實作的時候用的scrapy，先建立一個scrapy項目：

scrapy startproject zhihuEmoji

然後在spiders檔案夾下建立一個zhihuEmoji.py檔案，實作我們的爬蟲主程式：

'''知乎表情包爬取'''
class zhihuEmoji(scrapy.Spider):
    name = 'zhihuEmoji'
    allowed_domains = ['www.zhihu.com']
    question_id = '302378021'
    answer_url = 'https://www.zhihu.com/node/QuestionAnswerListV2'
    headers = {
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
                'Accept-Encoding': 'gzip, deflate'
            }
    ua = UserAgent()
    '''請求函數'''
    def start_requests(self):
        offset = -10
        size = 10
        while True:
            offset += size
            data = {
                        'method': 'next',
                        'params': '{"url_token":%s,"page_size":%s,"offset":%s}' % (self.question_id, size, offset)
                    }
            self.headers['user-agent'] = self.ua.random
            yield scrapy.FormRequest(url=self.answer_url, formdata=data, callback=self.parse, headers=self.headers)
    '''解析函數'''
    def parse(self, response):
        # 用來儲存圖檔
        if not os.path.exists(self.question_id):
            os.mkdir(self.question_id)
        # 解析響應獲得問題回答中的資料, 然後擷取每個回答中的圖檔連結并下載下傳
        item = ZhihuemojiItem()
        answers = eval(response.text)['msg']
        imgregular = re.compile('data-original="(.*?)"', re.S)
        answerregular = re.compile('data-entry-url="\\\\/question\\\\/{question_id}\\\\/answer\\\\/(.*?)"'.format(question_id=self.question_id), re.S)
        for answer in answers:
            item['answer_id'] = re.findall(answerregular, answer)[0]
            image_url = []
            for each in re.findall(imgregular, answer):
                each = each.replace('\\', '')
                if each.endswith('r.jpg'):
                    image_url.append(each)
            image_url = list(set(image_url))
            for each in image_url:
                item['image_url'] = each
                self.headers['user-agent'] = self.ua.random
                self.download(requests.get(each, headers=self.headers, stream=True))
                yield item
    '''下載下傳圖檔'''
    def download(self, response):
        if response.status_code == 200:
            image = response.content
            filepath = os.path.join(self.question_id, str(len(os.listdir(self.question_id)))+'.jpg')
            with open(filepath, 'wb') as f:
                f.write(image)

其中ZhihuemojiItem（）用于存儲我們爬取的所有圖檔連結和對應的回答id，具體定義如下：

class ZhihuemojiItem(scrapy.Item):
    image_url = scrapy.Field()
    answer_id = scrapy.Field()

OK，大功告成，完整源代碼詳見個人簡介擷取相關檔案

Python系列爬蟲之Scrapy實戰 | 爬取知乎表情包前言開發工具環境搭建原理簡介

前言

開發工具

Python版本：3.6.4

相關子產品：

環境搭建

原理簡介

繼續閱讀

github 如何和 xcode 聯系起來

Boss直聘Python爬蟲實戰

[轉]Top 20 Programming Lessons I've Learned in 20 Years

Linux系統指令行整理

Windows指令行使用Git下的Curl指令

科研項目管理平台

WIN10下指令行禁用編輯模式

Eclipse搭建Web Service服務

web service 的簡單實作

Web Service 應用執行個體

Web Service開發實戰

更改LYNC SIP位址

終端環境之tmux

lvm建立、擴容

HBuilder開發App Step1——環境搭建，HelloMUI 以及真機調試

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins