Python爬虫学习进阶

2023-08-07 18:47:22

Python的urllib和urllib2模块都做与请求URL相关的操作，但他们提供不同的功能。他们两个最显着的差异如下：

urllib2可以接受一个Request对象，并以此可以来设置一个URL的headers，但是urllib只接收一个URL。这意味着，你不能伪装你的用户代理字符串等。

urllib模块可以提供进行urlencode的方法，该方法用于GET查询字符串的生成，urllib2的不具有这样的功能。这就是urllib与urllib2经常在一起使用的原因。

#爬糗事百科段子
import urllib,urllib2
import re

import sys

page = 2
def getPage(page_num=1):
    url = "https://www.qiushibaike.com/8hr/page/" + str(page_num)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
    try:
        request = urllib2.Request(url, headers=headers)
        response = urllib2.urlopen(request)
        html = response.read()
        return html
    except urllib2.URLError, e:
        if hasattr(e, "code"):
            print "连接服务器失败，错误代码: {0}".format(e.code)
            return None
        if hasattr(e, "reason"):
            print "连接服务器失败，错误圆圆: {0}".format(e.reason)
            return None
def getPageCoent(page_num=1):
    html =getPage(page_num)
    re_page = re.compile(
        r'<div class="author.*?>.*?<a.*?<img.*?alt="(.*?)">.*?<div class="content">.*?<span>(.*?)</span>.*?</div>.*?<span class="stats-vote">.*?<i class="number">(\d+)</i>',
        re.S)
    items = re_page.findall(html)
    page_contents = []
    replaceBR = re.compile(r'<br/>')

    for item in items:
        content = item[1]
        new_content = replaceBR.sub('\n', content)
        page_contents.append([page_num,
                             item[0].strip(),
                              new_content.strip(),
                             item[2].strip()]
                             )
    return page_contents
def getOneStory(page_contents):
    for story in page_contents:
        input = raw_input()
        if input == 'Q' or input == 'q':
            sys.exit()
        print "第{0}页\t发布人:{1}\t赞;{2}\n{3}\n".format(story[0],story[1],story[3],story[2])
if __name__ == '__main__':
    print("正在看段子，按回车看新段子，退出q")
    num = 1
    while True:
        page_contents = getPageCoent(num)
        getOneStory(page_contents)
        num += 1

Python爬虫学习进阶

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入