python爬蟲爬取糗事百科的段子

2023-04-25 01:19:13

問題場景

之後的項目需要爬蟲抓取一些資訊，找個例子練練手，特此記錄。

環境介紹

Windows

Python2.7

IDEA15

通用抓取流程

python爬蟲爬取糗事百科的段子

本文思路

1、給爬蟲一個目标即網頁位址及參數

2、設定一些必要的參數

3、抓取網頁源代碼

4、提取資料

5、儲存資料（本文是儲存在檔案裡，多數是存在資料庫裡）

目标

python爬蟲爬取糗事百科的段子

CODE

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import urllib2
import re
import os


class Spider:
    # 構造方法
    def __init__(self, url, headers):
        self.url = url
        self.headers = headers

    def spider(self, page, pattern):
        # 一共13頁
        for i in range(, page):
            # 抓取過程
            # 1、通路其中一個網頁，抓取源代碼
            try:
                request = urllib2.Request(url=self.url % str(i), headers=self.headers)
                response = urllib2.urlopen(request)
                content = response.read()
            except urllib2.HTTPError as e:
                print e
                exit()
            except urllib2.URLError as e:
                print e
                exit()
            # 2、提取你的資料
            regex = re.compile(pattern, re.S)
            items = re.findall(regex, content)
            # 3、儲存資料
            path = 'qiubai'
            if not os.path.exists(path):
                os.makedirs(path)
            file_path = path + "/qiubai" + str(i) + ".txt"
            file = open(file_path, 'w')
            for item in items:
                # 把\n去掉，<br/>換成\n
                item = item.replace('\n', '').replace('<br/>', '\n')
                item += "\n\n"
                file.write(item)

            file.close()


if __name__ == '__main__':
    url = 'https://www.qiushibaike.com/text/page/%s/'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'}
    # 結合你的網頁寫表達式
    pattern = '<div class="content">.*?<span>(.*?)</span>.*?</div>'
    s = Spider(url, headers)
    s.spider(, pattern)

結果截圖

結果檔案夾

python爬蟲爬取糗事百科的段子

結果檔案

python爬蟲爬取糗事百科的段子

python爬蟲爬取糗事百科的段子

問題場景

環境介紹

通用抓取流程

本文思路

目标

CODE

結果截圖

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入