1-BeautifulSoup4庫的基本介紹
注意:一旦加載,beautifulsoup會自動建立模型,是以開銷較大;lxml是C語言編寫,速度較快。
2-BeautifulSoup4庫的基本使用
簡單使用:
from bs4 import BeautifulSoup
html = """
<a href="https://www.doutula.com/article/detail/6394359" target="_blank" rel="external nofollow" class="list-group-item random_list tg-article">
<div class="random_title">鬥圖<div class="date">2020-09-07</div>
</div>
<div class="random_article">
<div class="col-xs-6 col-sm-3">
<img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
alt="" data-was-processed="true">
<p></p>
</div>
<div class="col-xs-6 col-sm-3">
<img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
alt="" data-was-processed="true">
<p></p>
</div>
</div>
</a>
"""
# 建立beautiful Soup對象
# 使用lxml來解析
soup = BeautifulSoup(html, 'lxml') # 會自動不全body、html等标簽
print(soup.prettify()) # 按照格式美化輸出
解析器:
注意:html5lib的容錯力最強(html不規範,可以自動解決)
四個常用的對象:
Beautiful Soup将複雜的HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:
- Tag
- NavigatableString
- BeautifulSoup
- Comment
3-BeautifulSoup4庫提取資料
# 需求:
# 1.擷取所有tr标簽
# 2.擷取第2個tr标簽
# 3.擷取所有class等于even的标簽
# 4.擷取所有id等于test,class也等于test的a标簽提取出來
# 5.擷取所有a标簽的href屬性
# 6.擷取所有職位資訊(純文字)
from bs4 import BeautifulSoup
# html = 'tencent.html'
soup = BeautifulSoup(html, 'lxml')
# 1.擷取所有tr标簽
trs = soup.find_all('tr')
for tr in trs:
print(tr)
print('='*30)
print(type(tr)) # bs4.element.Tag 資料類型(調用了一個方法是以可以以字元串形式列印出來)
# 2.擷取第2個tr标簽
# trs = soup.find_all('tr', limit=2) # 傳回清單;limit:限定最多擷取多少個元素
tr = soup.find_all('tr', limit=2)[1]
print(tr)
# 3.擷取所有class等于even的标簽
# trs = soup.find_all('tr', class_='even') # 因為在pyhon中class為關鍵字,是以加下劃線:class_給以區分
# attribute
trs = soup.find_all('tr', attrs={'class':"even"})
for tr in trs:
print(tr)
print('='*30)
# 4.擷取所有id等于test,class也等于test的a标簽提取出來
# aList = soup.find_all(a, id='test', class='test')
aList = soup.find_all(a, attrs={'id':"test", "class"="test"})
for a in aList:
print(a)
# 5.擷取所有a标簽的href屬性
aList = soup.find_all('a')
for a in aList:
# 1.通過下表操作的方式
href = a['href']
print(href)
# 2.通過attrs屬性的方式
href = a.attrs['href']
print(href)
# 6.擷取所有職位資訊(純文字)
trs = soup.find_all('tr')[1:]
movies = []
for tr in trs:
movie = {}
# 方法1:
# tds = tr.find_all('td')
# title = tds[0].string
# category = tds[1].string
# nums = tds[2].string
# city = tds[3].string
# pubtime = tds[4].string
# movie['title'] = title
# movie['category'] = category
# movie['nums'] = nums
# movie['city'] = city
# movie['pubtime'] = pubtime
# movies.append(movie)
# 方法2:
# infos = tr.strings # 所有的非标簽字元,傳回的是生成器
# infos = list(infos) # 轉換成清單
infos = list(tr.stripped_strings) # 擷取非空白字元
movie['title'] = infos[0]
movie['category'] = infos[1]
movie['nums'] = infos[2]
movie['city'] = infos[3]
movie['pubtime'] = infos[4]
movies.append(movie)
print(movies)
BeautifulSoup總結:
4-BeaufifulSoup拾遺
1) comment類型:
html = """
<p>
<!--我是注釋字元串-->
</p>
"""
# 這是注釋
from bs4 import BeautifulSoup
# from bs4.element import Tag
# from bs4.element import NavigableString
soup = BeautifulSoup(html, 'lxml')
p = soup.find('p')
print(type(p))
print(type(p.string))
結果:
2) contents和children:
3)注意:string與contents
1、對于資料:一行形式
<p>字元串</p>
---p=soup.find('p')
用:print(p.string),此時可以得到 “字元串”
2、對于資料:3行的形式
<p>
字元串
</p>
- 用:print(p.string),此時傳回為空,應為當有多個字元串時,.string函數不能使用(有換行符\n)
- 用:p.contents,可以列印:['\n', '字元串', '\n']