python页面分析之bs4模块

一.bs4简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

官方文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

二.bs4模块的解析器

1.Python标准库

使用方法：

BeautifulSoup(markup, “html.parser”)

优点：

Python的内置标准库

执行速度适中

文档容错能力强

劣势：

Python 2.7.3 or 3.2.2)前的版本中文档容错能力差

2.lxml HTML 解析器

使用方法：

BeautifulSoup(markup, “lxml”)

优点：

速度快

文档容错能力强

劣势：

需要安装C语言库

3.lxml XML 解析器

使用方法：

BeautifulSoup(markup, [“lxml-xml”])

BeautifulSoup(markup, “xml”)

优点：

速度快

唯一支持XML的解析器

劣势：

需要安装C语言库

4.html5lib

使用方法：

BeautifulSoup(markup, “html5lib”)

优点：

最好的容错性

以浏览器的方式解析文档

生成HTML5格式的文档

劣势：

速度慢

不依赖外部扩展

推荐使用lxml作为解析器,因为效率更高.但需要安装

安装方法：

pip install lxml

三.使用方法

1.如果仅是想要解析HTML文档,只要用文档创建 BeautifulSoup 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.例如：

html = """
<html>
<head><title class='title'>story12345</title></head>
<body>
<p class="title" name="dromouse">The Dormouse's story</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><span>westos</span><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister1" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

<input type="text">
<input type="password">
"""
from bs4 import BeautifulSoup
#可以传入一段字符串或者时文件
soup = BeautifulSoup(open("文件名"))
soup = BeautifulSoup("html")

2.还可以通过features参数指定使用那种解析器来解析当前文档.

from bs4 import BeautifulSoup
#可以传入一段字符串或者时文件
soup = BeautifulSoup(open("文件名"),'lxml')
soup = BeautifulSoup("html",'lxml')

四.bs4模块的四种对象

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

1.BeautifulSoup 对象

BeautifulSoup 对象表示的是一个文档的全部内容，上面内容’三.使用方法’ 所创建的就是BeautifulSoup对象

2.Tag对象

我们可以从 BeautifulSoup 对象中得到 Tag 对象，也就是 HTML/XML 中的标签。

soup = BeautifulSoup(html, 'lxml')
tag = soup.a
print(type(tag))
#<class 'bs4.element.Tag'>

3.NavigableString类

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

tag = soup.p
print(tag.string)
# The Dormouse's story
print(type(tag.string))
# <class 'bs4.element.NavigableString'>

4.Comment类

Comment类是用来包装注释

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

五.遍历文档树

tag标签的常用属性

1.可以通过‘soup.标签名‘来获得你所需要的标签，但只会返回第一个满足条件的标签。

soup = BeautifulSoup(html, 'lxml')
print(soup.a)
#<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>

2.可以通过 .name 方式得到 Tag 对象的名称

soup = BeautifulSoup(html, 'lxml')
tag = soup.a
print(tag.name)
#a

3.可以通过.attrs获取标签里面的属性信息

soup = BeautifulSoup(html, 'lxml')
print(soup.a.attrs)
#返回一个字典
#{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
print(soup.a.attrs['href'])
#可以通过字典的方法来获取想要的属性
#http://example.com/elsie

tag标签的常用方法

1.get方法用于得到标签下的属性值

print(soup.a.get('href'))
#http://example.com/elsie

2.string方法可以得到标签下的文本内容（只有在此标签下没有子标签，或者只有一个子标签的情况下才能返回其中的内容，否则返回的是None）

print(soup.p.string)
# The Dormouse's story
print(soup.a.string)
# None

3.get_text()可以获得一个标签中的所有文本内容，包括子孙节点的内容

print(soup.a.get_text())
# westos

4.对获取的属性信息进行修改

print(soup.a.get('href'))
#http://example.com/elsie

soup.a['href'] = 'http://www.baidu.com'
print(soup.a.get('href'))
print(soup.a)
#http://www.baidu.com
#<a class="sister" href="http://www.baidu.com" target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>

六.搜索文档树

1.find_all()方法

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件,并返回所有符合条件的tag（返回的是一个列表）

aTagObj =  soup.find_all('a')
print(aTagObj)
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>, <a class="sister1" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

for item in aTagObj:
    print(item)
# <a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>
# <a class="sister1" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>

2.find()方法

find()的用法与find_all一样，区别在于find用于查找第一个符合条件的tag

3.css匹配

写CSS时，标签名不加任何修饰，类名前加英文句号 .，id名前加 #

在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回类型是list

标签选择器

print(soup.select("title"))
# [<title class="title">story12345</title>]

类选择器

print(soup.select(".sister"))
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

id选择器

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>]

属性选择器

print(soup.select("input[type='password']"))
# [<input type="password"/>]

python页面分析之bs4模块

一.bs4简介

二.bs4模块的解析器

三.使用方法

四.bs4模块的四种对象

1.BeautifulSoup 对象

2.Tag对象

3.NavigableString类

4.Comment类

五.遍历文档树

六.搜索文档树

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入