天天看點

beautifulsoup使用記錄

from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html >
<head>
    <meta charset="UTF-8">
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2"><!-- Lacie --></a> and
    <a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式調用需要将html_str寫入到index.html檔案中
print soup.prettify()
           

BeautifulSoup會自動選擇适合的解析器解析html

beautifulsoup将複雜的html轉換成樹形結構,每個結點都是python對象,所有對象可以歸納為4種:如下

1)tag對象:

<title>The Dormouse's story</title>

<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>

title和a标簽中的内容被稱為tag對象

内容該被如何擷取呢??

結合上面的代碼加入
print soup.title
print soup.a
print soup.p      

結果為:

<title>The Dormouse's story</title>

<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" id="link1"><!-- Elsie --></a>

<p class="title"><b>The Dormouse's story</b></p>

擷取對象名和标簽名:

print soup.name
print soup.title.name      

修改對象标簽名:

print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title      

擷取标簽屬性:

print soup.p['class']
print soup.p.get('class')
print soup.p.attrs      

修改标簽屬性:

soup.p['class']='myclass'
print soup.p      

結果為:

<p class="myclass"><b>The Dormouse's story</b></p>

2)NavigableString對象:

擷取标簽中的值

print soup.p.string      

BeautifuSoup使用NavigableString類包裝标簽中的字元串,與python中的Unicode字元串相同,騎過unicode()方法可以直接将NavigableString對象轉換成Unicode字元串。

unicode_string = unicode(soup.p.string)

3)beautifulsoup對象:

beautifulsoup不是真正的html或xml的标記,沒有name和attribute屬性

為了标準化Tag對象,實作接口的統一,可以擷取name和attribute屬性

print type(soup.name)
print soup.name
print soup.attrs      

結果為:

<type 'unicode'>

[document]

{}

4)Comment對象:文檔注釋

print soup.a.string ====> 輸出:Elsie

print  type(soup.a.string)===》輸出:<class 'bs4.element.Comment'>

提取注釋時可以根據字元串類型擷取:

if type(soup.a.string)==bs4.element.Comment:

     print  soup.a.string

繼續閱讀