HTML解析之BeautifulSoup

BeautifulSoup4簡稱bs4，是爬蟲必學的三方庫，它是一個HTML/XML的解析器，主要是解析和提取 HTML/XML 資料，Beautiful Soup 是基于HTML DOM的，會載入整個文檔，解析整個DOM樹，而lxml（使用Xpath文法解析網頁資料）是局部周遊，是以時間和記憶體開銷都會大很多，是以性能要低于lxml。

官方文檔：https://beautifulsoup.readthedocs.io/zh_CN/latest/

pip 安裝：

pip install bs4

或者

pip install beautifulsoup4

BeautifulSoup使用文法：

from bs4 import BeautifulSoup
# 執行個體化一個BeautifulSoup對象，加載頁面源碼
soup = BeautifulSoup(要解析的文本, "解析器")
# 可以加載網際網路上的頁面到BeautifulSoup對象中。
soup = BeautifulSoup(res.text, "lxml")
# 可以加載本地html文檔或者網際網路上的頁面到BeautifulSoup對象中。
fp = open('./demo.html','r',encoding='utf-8')
soup = BeautifulSoup(fp, "lxml")
# 調用對象中的相關屬性和方法進行标簽定位和資料提取
print(soup.p)

常用的解析器

推薦使用lxml作為解析器,因為效率更高。

需要安裝：

pip install lxml

常用的屬性和方法

「首先，定義一個html文檔字元串：」

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister1" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

「不知道文檔html的格式群組織，使用prettify()按照網頁格式輸出」

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
print(soup.prettify())

通路标簽tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
print(type(soup.p))
print(soup.p) # 查找p标簽，找到第一個
print(soup.title) # 查找title标簽
print(soup.title.name) # 輸出title标簽的名字
print(soup.body.a) # body标簽下的a标簽
print(soup.a.parent.name) # a标簽父節點的标簽名字
輸出：
# <class 'bs4.element.Tag'>
# <p class="title"><b>The Dormouse's story</b></p>
# <title>The Dormouse's story</title>
# title
# <a class="sister1" href="http://example.com/elsie" id="link1">Elsie</a>
# p

擷取标簽屬性值

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
print(soup.p['class']) # 查找p标簽中class内容
#  ['title']
print(soup.a.attrs) # 傳回一個字典，擷取标簽所有屬性
# # {'href': 'http://example.com/elsie', 'class': ['sister1'], 'id': 'link1'}
print(soup.a.attrs['href']) # 擷取标簽指定屬性
#  http://example.com/elsie

搜尋周遊文檔

有時候根據文檔結構不能直接擷取标簽，則可以使用find()和find_all()方法擷取。

「find(name,attrs,...)」：查找标簽，傳回的是一個bs4.element.Tag對象，有多個結果，隻傳回第一個，沒有傳回None

「find_all(name,attrs,...)」:傳回的是一個bs4.element.Tag對象組成的list，不管有沒有找到，都是list

最常用的用法是出入

name

以及

attr

參數找出符合要求的标簽。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
print(soup.find('a')) # 查找第一個a标簽
#  <a class="sister1" href="http://example.com/elsie" id="link1">Elsie</a>

# 查找id屬性是link3的a标簽
print(soup.find('a', id="link3"))
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# class是python中的内置關鍵字，class屬性後加_ class_
print(soup.find('a',class_="sister1"))
#  <a class="sister1" href="http://example.com/elsie" id="link1">Elsie</a>

#  搜尋所有标簽
print(soup.find_all('a')) # 查找所有a标簽list
# [<a class="sister1" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 清單索引
print(soup.find_all('a')[2])
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 清單查找，傳回所有a标簽和b标簽
print(soup.find_all(['p','a']))

擷取标簽文本内容

「soup.标簽.text/get_text():」擷取标簽中所有文本内容，包括子節點

「soup.标簽.string：」擷取到最裡層标簽，可以直接用.string的方法擷取标簽内的文字

print(soup.a.text)
print(soup.a.string)
print(soup.a.get_text())
# Elsie
# Elsie
# Elsie

print(soup.find('p',class_="story").text)
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.

print(soup.find_all('a')[0].text)
# Elsie

print(soup.find('p').contents[0].text)
# The Dormouse's story

HTML解析之BeautifulSoup

pip 安裝：

BeautifulSoup使用文法：

常用的解析器

常用的屬性和方法

通路标簽tag

擷取标簽屬性值

搜尋周遊文檔

擷取标簽文本内容

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method