天天看点

beautifulsoup使用记录

from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html >
<head>
    <meta charset="UTF-8">
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2"><!-- Lacie --></a> and
    <a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式调用需要将html_str写入到index.html文件中
print soup.prettify()
           

BeautifulSoup会自动选择适合的解析器解析html

beautifulsoup将复杂的html转换成树形结构,每个结点都是python对象,所有对象可以归纳为4种:如下

1)tag对象:

<title>The Dormouse's story</title>

<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>

title和a标签中的内容被称为tag对象

内容该被如何获取呢??

结合上面的代码加入
print soup.title
print soup.a
print soup.p      

结果为:

<title>The Dormouse's story</title>

<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" id="link1"><!-- Elsie --></a>

<p class="title"><b>The Dormouse's story</b></p>

获取对象名和标签名:

print soup.name
print soup.title.name      

修改对象标签名:

print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title      

获取标签属性:

print soup.p['class']
print soup.p.get('class')
print soup.p.attrs      

修改标签属性:

soup.p['class']='myclass'
print soup.p      

结果为:

<p class="myclass"><b>The Dormouse's story</b></p>

2)NavigableString对象:

获取标签中的值

print soup.p.string      

BeautifuSoup使用NavigableString类包装标签中的字符串,与python中的Unicode字符串相同,骑过unicode()方法可以直接将NavigableString对象转换成Unicode字符串。

unicode_string = unicode(soup.p.string)

3)beautifulsoup对象:

beautifulsoup不是真正的html或xml的标记,没有name和attribute属性

为了标准化Tag对象,实现接口的统一,可以获取name和attribute属性

print type(soup.name)
print soup.name
print soup.attrs      

结果为:

<type 'unicode'>

[document]

{}

4)Comment对象:文档注释

print soup.a.string ====> 输出:Elsie

print  type(soup.a.string)===》输出:<class 'bs4.element.Comment'>

提取注释时可以根据字符串类型获取:

if type(soup.a.string)==bs4.element.Comment:

     print  soup.a.string

继续阅读