from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html >
<head>
<meta charset="UTF-8">
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow" class="sister" id="link2"><!-- Lacie --></a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式調用需要将html_str寫入到index.html檔案中
print soup.prettify()
BeautifulSoup會自動選擇适合的解析器解析html
beautifulsoup将複雜的html轉換成樹形結構,每個結點都是python對象,所有對象可以歸納為4種:如下
1)tag對象:
<title>The Dormouse's story</title>
<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>
title和a标簽中的内容被稱為tag對象
内容該被如何擷取呢??
結合上面的代碼加入
print soup.title
print soup.a
print soup.p
結果為:
<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" id="link1"><!-- Elsie --></a>
<p class="title"><b>The Dormouse's story</b></p>
擷取對象名和标簽名:
print soup.name
print soup.title.name
修改對象标簽名:
print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title
擷取标簽屬性:
print soup.p['class']
print soup.p.get('class')
print soup.p.attrs
修改标簽屬性:
soup.p['class']='myclass'
print soup.p
結果為:
<p class="myclass"><b>The Dormouse's story</b></p>
2)NavigableString對象:
擷取标簽中的值
print soup.p.string
BeautifuSoup使用NavigableString類包裝标簽中的字元串,與python中的Unicode字元串相同,騎過unicode()方法可以直接将NavigableString對象轉換成Unicode字元串。
unicode_string = unicode(soup.p.string)
3)beautifulsoup對象:
beautifulsoup不是真正的html或xml的标記,沒有name和attribute屬性
為了标準化Tag對象,實作接口的統一,可以擷取name和attribute屬性
print type(soup.name)
print soup.name
print soup.attrs
結果為:
<type 'unicode'>
[document]
{}
4)Comment對象:文檔注釋
print soup.a.string ====> 輸出:Elsie
print type(soup.a.string)===》輸出:<class 'bs4.element.Comment'>
提取注釋時可以根據字元串類型擷取:
if type(soup.a.string)==bs4.element.Comment:
print soup.a.string