python爬蟲學習7_Beautiful Soup使用

簡介

Beautiful Soup 也是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 資料。

Beautiful Soup 是基于HTML DOM的，會載入整個文檔，解析整個DOM樹
BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支援CSS選擇器、Python标準庫中的HTML解析器

基礎使用：

# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#建立 Beautiful Soup 對象
soup = BeautifulSoup(html,'lxml')

#打開本地 HTML 檔案的方式來建立對象
#soup = BeautifulSoup(open('index.html'))

#格式化輸出 soup 對象的内容
print (soup.prettify())

四大對象種類

Beautiful Soup将複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag 通俗點講就是 HTML 中的一個個标簽，它有兩個重要的屬性，是 name 和 attrs。

print soup.name
# [document] #soup 對象本身比較特殊，它的 name 即為 [document]

print soup.head.name
# head #對于其他内部标簽，輸出的值便為标簽本身的名稱

print soup.p.attrs
# {'class': ['title'], 'name': 'dromouse'}
# 在這裡，我們把 p 标簽的所有屬性列印輸出了出來，得到的類型是一個字典。

print soup.p['class'] # soup.p.get('class')
# ['title'] #還可以利用get方法，傳入屬性的名稱，二者是等價的

soup.p['class'] = "newClass"
print soup.p # 可以對這些屬性和内容等等進行修改
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

del soup.p['class'] # 還可以對這個屬性進行删除
print soup.p
# <p name="dromouse"><b>The Dormouse's story</b></p>

對tag的操作隻能是擷取tag的相關資訊，而擷取不到它所包含的内容。

NavigableString

擷取标簽中的内容。

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# In [13]: <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的内容。大部分時候,可以把它當作 Tag 對象，是一個特殊的 Tag，我們可以分别擷取它的類型

print type(soup.name)
# <type 'unicode'>

print soup.name 
# [document]

print soup.attrs # 文檔本身的屬性為空
# {}

Comment

是一個特殊類型的 NavigableString 對象，其輸出的内容不包括注釋符号。

文檔樹的操作

之前的标簽擷取，是在比對到對應的标簽就進行相應的傳回，

輸出子節點（.contents ，.children）

# 1、将tag的子節點以清單的方式輸出
print soup.head.contents 
#[<title>The Dormouse's story</title>]

# 2、用清單索引來擷取它的某一個元素
print soup.head.contents[0]
#<title>The Dormouse's story</title>

# 3、它是一個 list 生成器對象，通過它來周遊所有子孩子
print soup.head.children
#<listiterator object at 0x7f71457f5710>
for child in  soup.body.children:
    print child

所有子孫節點（.descendants）

.contents 和 .children 屬性僅包含tag的直接子節點，.descendants 屬性可以對所有tag的子孫節點進行遞歸循環

for child in soup.descendants:
    print child

搜尋文檔樹

find_all

find_all(name, attrs, recursive, text, **kwargs)

搜尋目前tag的所有tag子節點,并判斷是否符合過濾器的條件。

name：可以查找所有名字為 name 的tag。 attrs：定義一個字典參數來搜尋包含特殊屬性的tag。 recursive：Beautiful Soup會檢索目前tag的所有子孫節點,如果隻想搜尋tag的直接子節點,可以使用參數 recursive=False。 limit：參數限制傳回結果的數量. kwargs：Beautiful Soup 會搜尋每個 class 屬性為 title 的 tag 。kwargs 接收字元串，正規表達式

# 1、超找為‘b’的tag
soup.find_all('b')
# [<b>The Dormouse's story</b>]

# 2、通過正則查找
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

# 3、通過清單參數超找a标簽和b标簽   
soup.find_all(["a", "b"])

# 4、通過text查找
soup.find_all(text="Elsie")
# [u'Elsie']

CSS選擇器查找

寫 CSS 時：

标簽名：不加任何修飾

類名：前加

id名：前加

通過這個性質bs4有對應的查找方法soup.select()。

soup.select()

通過标簽名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

通過類名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

通過 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>]

組合查找

和寫 class 檔案時，标簽名與類名、id名進行的組合原理是一樣的。需要用空格分開

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>]

屬性查找

加入屬性元素，屬性需要用中括号括起來，注意屬性和标簽屬于同一節點，是以中間不能加空格，否則會無法比對到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  id="link1"><!-- Elsie --></a>]

擷取内容

以上的 select 方法傳回的結果都是清單形式，可以周遊形式輸出，然後用 get_text() 方法來擷取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

python爬蟲學習7_Beautiful Soup使用python爬蟲學習7_Beautiful Soup使用

python爬蟲學習7_Beautiful Soup使用

簡介

四大對象種類

Tag

NavigableString

BeautifulSoup

Comment

文檔樹的操作

輸出子節點（.contents ，.children）

所有子孫節點（.descendants）

搜尋文檔樹

find_all

CSS選擇器查找

soup.select()

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入