BeautifulSoup

写文章之前先吐槽几句：Python这个玩意哪都好，又简单又直观，对于我这种编程新人来说的确很不错，但是python有致命的坑点就是更新太快。现在比较流行的是2.7版本和3.5版本，偏偏2.7版本有些命令不支持在3.5里，比如import sys,reload(sys)，2.7可以直接使用，而3.5就不行。比如reduce命令，2.7可以直接使用，3.5就不行。我也是开了眼界，竟然还有高级版本不容纳低级版本的！

但是很多人又说3.5是形势所趋，可2.7又有很多地方很方便。现在网络课程满大街都是，很多老师上来就开始讲python，讲倒是无所谓，讲了半天就是不讲自己的浏览器型号，也不讲自己的python编译器版本号，一个破pycharm，一年里能更新好几百次，每一次更新不但网页变样，里面东西的位置也跟着变样。对于初学者来说，特别容易蒙圈。

好，吐槽完毕，毕竟再怎么吐槽人也得活着。那就克服困难吧。

BeautifulSoup模块在成功安装之后，使用from bs4 import BeautifulSoup启动模块。启动完毕之后，就可以输入想要搜索的“内容”，BeautifulSoup主要是面对网页文件的，因为网页的源代码是一对一对出现的，BeautifulSoup能很快的正确定位。

假如，我们要搜索的内容是：

html_doc="""<html>

<head>

</head>

<body>

<div class="topic"><a href="www.51cto.com/welcome.html">欢迎来到这里！</a>

<url>

<li><a href="http://www.51cto.com/1.html">这是第一页</a></li>

<li><a href="http://www.51cto.com/2.html">这是第二页</a></li>

<li><a href="http://www.51cto.com/3.html">这是第三页</a></li>

</url>

</div>

</body>

</html>"""

输入完毕之后，需要搞一个Soup的模块；

soup=BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")

其中括号里html_doc是“需要搜索的范围”，"html.parser"是用来解析的工具，from_encoding="utf-8"是代码的格式。

那么我观察，他们都是处于<a href="blablabla">文字</a>这样的样式里。那么我们给所有的链接取一个变量名叫links。

link=soup.find_all("a")

这里使用的是find_all而不是find，是因为find_all有点类似re.findall，刨根问底拦不住。而find是re.search，查到一个就收手了。

for each in BBB:

print(each.name,each["href"],each.get_text())

这样的输出结果会是：

a www.51cto.com/welcome.html 欢迎来到这里！

a http://www.51cto.com/1.html 这是第一页

a http://www.51cto.com/2.html 这是第二页

a http://www.51cto.com/3.html 这是第三页

=================================分割线=====================================

<code>from</code> <code>bs4 </code><code>import</code> <code>BeautifulSoup</code>

<code><html><head><title>The Dormouse's story</title></head></code>

<code><p class="title"><b>The Dormouse's story</b></p></code>

<code><p class="story">Once upon a time there were three little sisters; and their names were</code>

<code><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</code>

<code><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</code>

<code><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</code>

<code>and they lived at the bottom of a well.</p></code>

<code>soup</code><code>=</code><code>BeautifulSoup(AAA,</code><code>"html.parser"</code><code>,from_encoding</code><code>=</code><code>"utf-8"</code><code>)</code>

<code>AAA</code><code>=</code><code>soup.find(</code><code>"a"</code><code>,href</code><code>=</code><code>"http://example.com/tillie"</code><code>) </code>

<code>#通过<a href="http://example.com/tillie">进行了精准的定位</code>

<code>print</code><code>(AAA.get_text())</code>

这个程序运行的结果就是 Tillie

================================分割线======================================

<code> </code><code><div class="topic"><a href="www.51cto.com/welcome.html">欢迎来到这里！</a></code>

<code> </code><code><li><a href="http://www.51cto.com/1.html">这是第一页</a></li></code>

<code> </code><code><li><a href="http://www.51cto.com/2.html">这是第二页</a></li></code>

<code> </code><code><li><a href="http://www.51cto.com/3.html">这是第三页</a></li></code>

<code> </code><code>print</code><code>(each.name,each.get_text())</code>

这个程序跟第一个实验结构一样，但是却将关键字"a"改成"div"。做一下试验看看这个结果是什么？

如果关键字再由"div"换成了"url"，那么结果又是什么？对比一下关键字是"a"的时候，再对比一下关键字是"url"的时候，为何会有这样的不同结果呢？

如果最后一句是 print(each.name,each["href"],each.get_text()),这个结果又是什么？

本文转自苏幕遮618 51CTO博客，原文链接:http://blog.51cto.com/chenx1242/1730367

BeautifulSoup

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入