beautifulsoup使用记录

2023-08-05 14:41:49

from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html >
<head>
    <meta charset="UTF-8">
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2"><!-- Lacie --></a> and
    <a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式调用需要将html_str写入到index.html文件中
print soup.prettify()

BeautifulSoup会自动选择适合的解析器解析html

beautifulsoup将复杂的html转换成树形结构，每个结点都是python对象，所有对象可以归纳为4种：如下

1）tag对象：

<title>The Dormouse's story</title>

<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>

title和a标签中的内容被称为tag对象

内容该被如何获取呢？？

结合上面的代码加入
print soup.title
print soup.a
print soup.p

结果为：

<title>The Dormouse's story</title>

The Dormouse's story

获取对象名和标签名：

print soup.name
print soup.title.name

修改对象标签名：

print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title

获取标签属性：

print soup.p['class']
print soup.p.get('class')
print soup.p.attrs

修改标签属性：

soup.p['class']='myclass'
print soup.p

结果为：

The Dormouse's story

2）NavigableString对象：

获取标签中的值

print soup.p.string

BeautifuSoup使用NavigableString类包装标签中的字符串，与python中的Unicode字符串相同，骑过unicode()方法可以直接将NavigableString对象转换成Unicode字符串。

unicode_string = unicode(soup.p.string)

3）beautifulsoup对象：

beautifulsoup不是真正的html或xml的标记，没有name和attribute属性

为了标准化Tag对象，实现接口的统一，可以获取name和attribute属性

print type(soup.name)
print soup.name
print soup.attrs

beautifulsoup使用记录

继续阅读

av 146 003

av 146 001

Python使用easy install安装BeautifulSoup

python使用BeautifulSoup 解析HTML

BeautifulSoup爬取博客实例BeautifulSoup爬取博客实例

Beautifulsoup 返回none或空字符的一种情况

Beautiful Soup实战（一）

Python爬虫urllib笔记(四)之使用BeautifulSoup爬取百度贴吧

python爬虫入门(4)python爬虫入门(4)

python爬虫之BeautifulSoup入门

python爬虫实战2-获取当当网近30日好评榜前500本书籍-使用BeautifulSoup

【日常新手入门】android之点击变色

初学爬虫2：正则表达式一些知识点+京东商城物品排名+名称+价格的爬取（re+bs4+requests)

Python爬取百度备案信息Python爬取百度备案信息

利用Python进行简单爬虫（爬取豆瓣《湮灭》短评）写在最前爬虫正则表达式匹配做法BeautifulSoup做法最后

【学习日记】Python | 2020.2.26