python頁面分析之bs4子產品

一.bs4簡介

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜尋、修改分析樹等功能。它是一個工具箱，通過解析文檔為使用者提供需要抓取的資料，因為簡單，是以不需要多少代碼就可以寫出一個完整的應用程式。

Beautiful Soup自動将輸入文檔轉換為Unicode編碼，輸出文檔轉換為utf-8編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識别編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。

官方文檔：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

二.bs4子產品的解析器

1.Python标準庫

使用方法：

BeautifulSoup(markup, “html.parser”)

優點：

Python的内置标準庫

執行速度适中

文檔容錯能力強

劣勢：

Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差

2.lxml HTML 解析器

使用方法：

BeautifulSoup(markup, “lxml”)

優點：

速度快

文檔容錯能力強

劣勢：

需要安裝C語言庫

3.lxml XML 解析器

使用方法：

BeautifulSoup(markup, [“lxml-xml”])

BeautifulSoup(markup, “xml”)

優點：

速度快

唯一支援XML的解析器

劣勢：

需要安裝C語言庫

4.html5lib

使用方法：

BeautifulSoup(markup, “html5lib”)

優點：

最好的容錯性

以浏覽器的方式解析文檔

生成HTML5格式的文檔

劣勢：

速度慢

不依賴外部擴充

推薦使用lxml作為解析器,因為效率更高.但需要安裝

安裝方法：

pip install lxml

三.使用方法

1.如果僅是想要解析HTML文檔,隻要用文檔建立 BeautifulSoup 對象就可以了.Beautiful Soup會自動選擇一個解析器來解析文檔.例如：

html = """
<html>
<head><title class='title'>story12345</title></head>
<body>
<p class="title" name="dromouse">The Dormouse's story</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><span>westos</span><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister1" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

<input type="text">
<input type="password">
"""
from bs4 import BeautifulSoup
#可以傳入一段字元串或者時檔案
soup = BeautifulSoup(open("檔案名"))
soup = BeautifulSoup("html")

2.還可以通過features參數指定使用那種解析器來解析目前文檔.

from bs4 import BeautifulSoup
#可以傳入一段字元串或者時檔案
soup = BeautifulSoup(open("檔案名"),'lxml')
soup = BeautifulSoup("html",'lxml')

四.bs4子產品的四種對象

Beautiful Soup将複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment .

1.BeautifulSoup 對象

BeautifulSoup 對象表示的是一個文檔的全部内容，上面内容’三.使用方法’ 所建立的就是BeautifulSoup對象

2.Tag對象

我們可以從 BeautifulSoup 對象中得到 Tag 對象，也就是 HTML/XML 中的标簽。

soup = BeautifulSoup(html, 'lxml')
tag = soup.a
print(type(tag))
#<class 'bs4.element.Tag'>

3.NavigableString類

字元串常被包含在tag内.Beautiful Soup用 NavigableString 類來包裝tag中的字元串:

tag = soup.p
print(tag.string)
# The Dormouse's story
print(type(tag.string))
# <class 'bs4.element.NavigableString'>

4.Comment類

Comment類是用來包裝注釋

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

五.周遊文檔樹

tag标簽的常用屬性

1.可以通過‘soup.标簽名‘來獲得你所需要的标簽，但隻會傳回第一個滿足條件的标簽。

soup = BeautifulSoup(html, 'lxml')
print(soup.a)
#<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>

2.可以通過 .name 方式得到 Tag 對象的名稱

soup = BeautifulSoup(html, 'lxml')
tag = soup.a
print(tag.name)
#a

3.可以通過.attrs擷取标簽裡面的屬性資訊

soup = BeautifulSoup(html, 'lxml')
print(soup.a.attrs)
#傳回一個字典
#{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
print(soup.a.attrs['href'])
#可以通過字典的方法來擷取想要的屬性
#http://example.com/elsie

tag标簽的常用方法

1.get方法用于得到标簽下的屬性值

print(soup.a.get('href'))
#http://example.com/elsie

2.string方法可以得到标簽下的文本内容（隻有在此标簽下沒有子标簽，或者隻有一個子标簽的情況下才能傳回其中的内容，否則傳回的是None）

print(soup.p.string)
# The Dormouse's story
print(soup.a.string)
# None

3.get_text()可以獲得一個标簽中的所有文本内容，包括子孫節點的内容

print(soup.a.get_text())
# westos

4.對擷取的屬性資訊進行修改

print(soup.a.get('href'))
#http://example.com/elsie

soup.a['href'] = 'http://www.baidu.com'
print(soup.a.get('href'))
print(soup.a)
#http://www.baidu.com
#<a class="sister" href="http://www.baidu.com" target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>

六.搜尋文檔樹

1.find_all()方法

find_all() 方法搜尋目前tag的所有tag子節點,并判斷是否符合過濾器的條件,并傳回所有符合條件的tag（傳回的是一個清單）

aTagObj =  soup.find_all('a')
print(aTagObj)
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>, <a class="sister1" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

for item in aTagObj:
    print(item)
# <a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>
# <a class="sister1" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>

2.find()方法

find()的用法與find_all一樣，差別在于find用于查找第一個符合條件的tag

3.css比對

寫CSS時，标簽名不加任何修飾，類名前加英文句号 .，id名前加 #

在這裡我們也可以利用類似的方法來篩選元素，用到的方法是soup.select()，傳回類型是list

标簽選擇器

print(soup.select("title"))
# [<title class="title">story12345</title>]

類選擇器

print(soup.select(".sister"))
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

id選擇器

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><span>westos</span><!-- Elsie --></a>]

屬性選擇器

print(soup.select("input[type='password']"))
# [<input type="password"/>]

python頁面分析之bs4子產品

一.bs4簡介

二.bs4子產品的解析器

三.使用方法

四.bs4子產品的四種對象

1.BeautifulSoup 對象

2.Tag對象

3.NavigableString類

4.Comment類

五.周遊文檔樹

六.搜尋文檔樹

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入