03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

1-BeautifulSoup4庫的基本介紹

注意：一旦加載，beautifulsoup會自動建立模型，是以開銷較大；lxml是C語言編寫，速度較快。

2-BeautifulSoup4庫的基本使用

簡單使用：

from bs4 import BeautifulSoup

html = """  
<a href="https://www.doutula.com/article/detail/6394359" target="_blank" rel="external nofollow"  class="list-group-item random_list tg-article">
    <div class="random_title">鬥圖<div class="date">2020-09-07</div>
    </div>
    <div class="random_article">
        <div class="col-xs-6 col-sm-3">
            <img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
                src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_HBeQpW.jpg"
                alt="" data-was-processed="true">
            <p></p>
        </div>
        <div class="col-xs-6 col-sm-3">
            <img referrerpolicy="no-referrer" class="lazy image_dtb img-responsive loaded"
                src="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                data-original="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                data-backup="http://img.doutula.com/production/uploads/image/2020/09/07/20200907463024_kIOgEX.png"
                alt="" data-was-processed="true">
            <p></p>
        </div>
    </div>
</a>
"""
# 建立beautiful Soup對象
# 使用lxml來解析
soup = BeautifulSoup(html, 'lxml')  # 會自動不全body、html等标簽

print(soup.prettify())  # 按照格式美化輸出

解析器：

03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

注意：html5lib的容錯力最強（html不規範，可以自動解決）

四個常用的對象：

Beautiful Soup将複雜的HTML文檔轉換成一個複雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：

Tag
NavigatableString
BeautifulSoup
Comment

3-BeautifulSoup4庫提取資料

# 需求：
# 1.擷取所有tr标簽
# 2.擷取第2個tr标簽
# 3.擷取所有class等于even的标簽
# 4.擷取所有id等于test，class也等于test的a标簽提取出來
# 5.擷取所有a标簽的href屬性
# 6.擷取所有職位資訊（純文字）

from bs4 import BeautifulSoup 
# html = 'tencent.html'
soup = BeautifulSoup(html, 'lxml')
# 1.擷取所有tr标簽
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
    print('='*30)
    print(type(tr))  # bs4.element.Tag 資料類型(調用了一個方法是以可以以字元串形式列印出來)
# 2.擷取第2個tr标簽
# trs = soup.find_all('tr', limit=2)  # 傳回清單；limit:限定最多擷取多少個元素
tr = soup.find_all('tr', limit=2)[1]
print(tr)
# 3.擷取所有class等于even的标簽
# trs = soup.find_all('tr', class_='even')  # 因為在pyhon中class為關鍵字，是以加下劃線：class_給以區分
# attribute
trs = soup.find_all('tr', attrs={'class':"even"})
for tr in trs:
    print(tr)
    print('='*30)
# 4.擷取所有id等于test，class也等于test的a标簽提取出來
# aList = soup.find_all(a, id='test', class='test')
aList = soup.find_all(a, attrs={'id':"test", "class"="test"})
for a in aList:
    print(a)
# 5.擷取所有a标簽的href屬性
aList = soup.find_all('a')
for a in aList:
    # 1.通過下表操作的方式
    href = a['href']
    print(href)
    # 2.通過attrs屬性的方式
    href = a.attrs['href']
    print(href)
# 6.擷取所有職位資訊（純文字）
trs = soup.find_all('tr')[1:]
movies = []
for tr in trs:
    movie = {}
    # 方法1：
    # tds = tr.find_all('td')
    # title = tds[0].string
    # category = tds[1].string
    # nums = tds[2].string
    # city = tds[3].string
    # pubtime = tds[4].string
    # movie['title'] = title
    # movie['category'] = category
    # movie['nums'] = nums
    # movie['city'] = city
    # movie['pubtime'] = pubtime
    # movies.append(movie)
    # 方法2：
    # infos = tr.strings  # 所有的非标簽字元，傳回的是生成器
    # infos = list(infos) # 轉換成清單
    infos = list(tr.stripped_strings)  # 擷取非空白字元
    movie['title'] = infos[0]
    movie['category'] = infos[1]
    movie['nums'] = infos[2]
    movie['city'] = infos[3]
    movie['pubtime'] = infos[4]
    movies.append(movie)

print(movies)

BeautifulSoup總結：

03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

4-BeaufifulSoup拾遺

1) comment類型：

html = """
<p>
<!--我是注釋字元串-->
</p>
"""
# 這是注釋

from bs4 import BeautifulSoup
# from bs4.element import Tag
# from bs4.element import NavigableString
soup = BeautifulSoup(html, 'lxml')
p = soup.find('p')
print(type(p))
print(type(p.string))

結果:

03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

2) contents和children：

03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

3）注意：string與contents

1、對于資料：一行形式

<p>字元串</p>

---p=soup.find('p')

用：print(p.string)，此時可以得到 “字元串”

2、對于資料：3行的形式

<p>
字元串
</p>

用：print(p.string)，此時傳回為空，應為當有多個字元串時，.string函數不能使用（有換行符\n)
用：p.contents，可以列印：['\n', '字元串', '\n']

03-資料解析_BeautifulSoup+CSS選擇器（01 BeautifulSoup）

1-BeautifulSoup4庫的基本介紹

2-BeautifulSoup4庫的基本使用

3-BeautifulSoup4庫提取資料

BeautifulSoup總結：

4-BeaufifulSoup拾遺

繼續閱讀

SVM學習筆記（一）

閱讀筆記--java程式設計思想第四版 --接口嵌套

法理學學習筆記Day4——法律規則重點知識點法的微觀結構★★★★★（考察40次）（二級考點）法律規則（二級考點）1.法律

最大子段和問題（分治法和動态規劃）

#人教五上預習#知識點總結#學習打卡ing#學習筆記#假期學習

持續更新調研報告寫作資料，希望各位多多批評指正#寫材料#學習資料分享#學習筆記#每天學習一點點

android學習筆記3：存儲資料存儲 Key-Value 集資料儲存到檔案

安卓學習筆記（九）網絡程式設計網絡程式設計

安卓學習筆記（一） Activity篇

django短信驗證碼的後端實作

天池龍珠計劃Python訓練營-task2筆記清單元組字元串字典集合序列

2022秋招cpp相關面試總結（長期更新）1、記憶體對齊2、類的占用空間死鎖elf優化bin檔案c語言和c++中const差別sizeof原理malloc一塊記憶體free怎麼找到頭尾

2022秋招面試總結（cpp+java+測開）百度測開一面位元組後端一面蝦皮後端一面蝦皮後端二面

nagios服務端搭建

Apache 虛拟主機搭建過程

判斷浏覽器類型與版本以及ios安卓判别