beautifulsoup使用記錄

2023-08-05 14:41:49

from bs4 import BeautifulSoup
html_str='''
<!DOCTYPE html>
<html >
<head>
    <meta charset="UTF-8">
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2"><!-- Lacie --></a> and
    <a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup = BeautifulSoup(html_str,'lxml',from_encoding='utf-8')
#soup = BeautifulSoup(open('index.html')) 此方式調用需要将html_str寫入到index.html檔案中
print soup.prettify()

BeautifulSoup會自動選擇适合的解析器解析html

beautifulsoup将複雜的html轉換成樹形結構，每個結點都是python對象，所有對象可以歸納為4種：如下

1）tag對象：

<title>The Dormouse's story</title>

<a href="http://example.com/elsie" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>

title和a标簽中的内容被稱為tag對象

内容該被如何擷取呢？？

結合上面的代碼加入
print soup.title
print soup.a
print soup.p

結果為：

<title>The Dormouse's story</title>

The Dormouse's story

擷取對象名和标簽名：

print soup.name
print soup.title.name

修改對象标簽名：

print soup.title
soup.title.name = 'mytitle'
print soup.name
print soup.mytitle.name
print soup.mytitle
print soup.title

擷取标簽屬性：

print soup.p['class']
print soup.p.get('class')
print soup.p.attrs

修改标簽屬性：

soup.p['class']='myclass'
print soup.p

結果為：

The Dormouse's story

2）NavigableString對象：

擷取标簽中的值

print soup.p.string

BeautifuSoup使用NavigableString類包裝标簽中的字元串，與python中的Unicode字元串相同，騎過unicode()方法可以直接将NavigableString對象轉換成Unicode字元串。

unicode_string = unicode(soup.p.string)

3）beautifulsoup對象：

beautifulsoup不是真正的html或xml的标記，沒有name和attribute屬性

為了标準化Tag對象，實作接口的統一，可以擷取name和attribute屬性

print type(soup.name)
print soup.name
print soup.attrs

beautifulsoup使用記錄

繼續閱讀

av 146 003

av 146 001

Python使用easy install安裝BeautifulSoup

python使用BeautifulSoup 解析HTML

BeautifulSoup爬取部落格執行個體BeautifulSoup爬取部落格執行個體

Beautifulsoup 傳回none或空字元的一種情況

Beautiful Soup實戰（一）

Python爬蟲urllib筆記(四)之使用BeautifulSoup爬取百度貼吧

python爬蟲入門(4)python爬蟲入門(4)

python爬蟲之BeautifulSoup入門

python爬蟲實戰2-擷取當當網近30日好評榜前500本書籍-使用BeautifulSoup

【日常新手入門】android之點選變色

初學爬蟲2：正規表達式一些知識點+京東商城物品排名+名稱+價格的爬取（re+bs4+requests)

Python爬取百度備案資訊Python爬取百度備案資訊

利用Python進行簡單爬蟲（爬取豆瓣《湮滅》短評）寫在最前爬蟲正規表達式比對做法BeautifulSoup做法最後

【學習日記】Python | 2020.2.26