python爬蟲-抓取内涵吧内涵段子

2017-12-20 23:50:00

這是個python簡易爬蟲，主要使用了requests和re子產品，适合入門。

出處：

https://github.com/jingsupo/python-spider/blob/master/day03/04neihanba.py

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests, re, time


class Neihanspider(object):
    def __init__(self):
        self.base_url = 'http://www.neihan8.com/article/list_5_'
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}
        # 第一層解析的正規表達式 正則裡面的符号不能改，必須照原樣複制過來
        self.first_pattern = re.compile(r'<div class="f18 mb20">.*?</div>', re.S)
        # 第二層解析的正規表達式 去除所有标簽 字元實體 空白 全角空格
        self.second_pattern = re.compile(r'<.*?>|&.*?;|\s|　　')

    # 發送請求
    def send_request(self, url):
        time.sleep(2)
        try:
            response = requests.get(url, headers=self.headers)
            return response.content
        except Exception as e:
            print e

    # 寫入檔案
    def write_file(self, data, page):
        with open('04neihanba.txt', 'a') as f:
            filename = '第' + str(page) + '頁的段子\n'
            print filename
            f.write('-' * 10 + '\n')
            f.write(filename)
            f.write('-' * 10 + '\n')

            for first_data in data:
                # 第二層解析
                content = self.second_pattern.sub('', first_data)
                f.write(content)
                # 在每個段子結束的時候加個換行
                f.write('\n\n')

    # 排程方法
    def start_work(self):
        for page in range(1, 5):
            # 拼接url
            url = self.base_url + str(page) + '.html'

            # 發送請求
            data = self.send_request(url)

            # 轉碼
            data = data.decode('gbk').encode('utf-8')

            # 第一層解析
            data_list = self.first_pattern.findall(data)

            # 将資料寫入檔案
            self.write_file(data_list, page)


if __name__ == '__main__':
    spider = Neihanspider()
    spider.start_work()

python爬蟲-抓取内涵吧内涵段子

繼續閱讀

Windows下VS開發環境環境安裝工程項目設定關于Debug和Release的提示

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Windows下配置Apache的SSL服務

Mac｜Windows系統本地照片自動上傳到伺服器

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入