Python基于urllib,re爬取百度的国内即时新闻

2017-11-13 23:50:00

Python应用于爬虫领域业界已经相当的广泛了，今天就采用urllib + re 爬取下百度国内即时新闻。

软件环境：

Python : 3.6.0

PyCharm: Community 2017.2

Python 下载地址 https://www.python.org/downloads/

Pycharm 下载地址(Community是免费的) https://www.jetbrains.com/pycharm/download/#section=windows

主要思路：

采用urllib请求制定url，拿到网页的html，然后采用re进行正则匹配找到新闻标题

爬取过程：

1. 导入urllib 和 re 两个模块

<code>import</code> <code>urllib</code>

<code>from</code> <code>urllib </code><code>import</code> <code>request</code>

<code>import</code> <code>re</code>

2. 采用urllib.request.urlopen 打开百度信息url,并取得所有html

<code>url </code><code>=</code> <code>"http://news.baidu.com/guonei"</code>

<code>response </code><code>=</code> <code>urllib.request.urlopen(url)</code>

<code>html </code><code>=</code> <code>response.read().decode(</code><code>'utf-8'</code><code>)</code>

urllib.urlopen()方法用于打开一个url地址。

read()方法用于读取URL上的数据，并把整个页面下载下来。

3. 在Chrome中按F12可以查看到网页的源代码，可以看到新闻位于 div id="instant-news"下面

4. 获取即时信息的整个div的html并存储到变量: instant_news_html

<code>pattern_of_instant_news </code><code>=</code> <code>re.</code><code>compile</code><code>(</code><code>'<div id="instant-news.*?</div>'</code><code>,re.S)</code>

<code>instant_news_html </code><code>=</code> <code>re.findall(pattern_of_instant_news, html)[</code><code>0</code><code>]</code>

5. 从全部news的html中匹配出每一个新闻标题

<code>pattern_of_news </code><code>=</code> <code>re.</code><code>compile</code><code>(</code><code>'<li><a.*?>(.*?)</a></li>'</code><code>, re.S)</code>

<code>news_list </code><code>=</code> <code>re.findall(pattern_of_news, instant_news_html)</code>

<code> </code><code>print</code><code>(news)</code>

将会看到如入结果

完整源代码：

本文转自 yuanzhitang 51CTO博客，原文链接：http://blog.51cto.com/yuanzhitang/2057777，如需转载请自行联系原作者

Python基于urllib,re爬取百度的国内即时新闻

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入