來源莫煩爬蟲

https://mofanpy.com/tutorials/data-manipulation/scraping/understand-website/

爬網頁流程

選着要爬的網址 (url)
使用 python 登入上這個網址 (urlopen等)
讀取網頁資訊 (read() 出來)
将讀取的資訊放入 BeautifulSoup
使用 BeautifulSoup 選取 tag 資訊等 (代替正規表達式)

1.使用 Python 來登入這個網頁, 并列印出這個網頁 HTML 的 source code. 注意, 因為網頁中存在中文, 為了正常顯示中文,

read()

完以後, 我們要對讀出來的文字進行轉換,

decode()

成可以正常顯示中文的形式.

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://mofanpy.com/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

比對網頁内容

是以這裡我們使用 Python 的正規表達式 RegEx 進行比對文字, 篩選資訊的工作. 我有一個很不錯的正規表達式的教程, 如果是初級的網頁比對, 我們使用正則完全就可以了, 進階一點或者比較繁瑣的比對, 我還是推薦使用 BeautifulSoup. 不急不急, 我知道你想偷懶, 我之後馬上就會教 beautiful soup 了. 但是現在我們還是使用正則來做幾個簡單的例子, 讓你熟悉一下套路.

如果我們想用代碼找到這個網頁的 title, 我們就能這樣寫. 選好要使用的 tag 名稱

<title>

. 使用正則比對.

import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

# Page title is:  Scraping tutorial 1 | 莫煩Python

如果想要找到中間的那個段落

<p>

, 我們使用下面方法, 因為這個段落在 HTML 中還夾雜着 tab, new line, 是以我們給一個

flags=re.DOTALL

來對這些 tab, new line 不敏感.

res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])

# Page paragraph is:
#  這是一個在 <a href="https://mofanpy.com/" target="_blank" rel="external nofollow" >莫煩Python</a>
#  <a href="https://mofanpy.com/tutorials/scraping" target="_blank" rel="external nofollow" >爬蟲教程</a> 中的簡單測試.

最後一個練習是找一找所有的連結, 這個比較有用, 有時候你想找到網頁裡的連結, 然後下載下傳一些内容到電腦裡, 就靠這樣的途徑了.

res = re.findall(r'href="(.*?)" target="_blank" rel="external nofollow" ', html)
print("\nAll links: ", res)
# All links:
['https://mofanpy.com/static/img/description/tab_icon.png',
'https://mofanpy.com/',
'https://mofanpy.com/tutorials/scraping']

使用BeautifulSoup完成上概述任務

安裝 Beautiful Soup

如果你用的是新版的Debain或ubuntu,那麼可以通過系統的軟體包管理來安裝:

$ apt-get install Python-bs4

Beautiful Soup 4 通過PyPi釋出,是以如果你無法使用系統包管理安裝,那麼也可以通過

easy_install

或

pip

來安裝.包的名字是

beautifulsoup4

,這個包相容Python2和Python3.

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

安裝解析器

Beautiful Soup支援Python标準庫中的HTML解析器,還支援一些第三方的解析器,其中一個是 lxml .根據作業系統不同,可以選擇下列方法來安裝lxml:

簡單爬蟲入門使用BeautifulSoup完成上概述任務

将一段文檔傳入BeautifulSoup 的構造方法,就能得到一個文檔的對象, 可以傳入一段字元串或一個檔案句柄.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

首先,文檔被轉換成Unicode,并且HTML的執行個體都被轉換成Unicode編碼

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

然後,Beautiful Soup選擇最合适的解析器來解析這段文檔,如果手動指定解析器那麼Beautiful Soup會選擇指定的解析器來解析文檔.(參考解析成XML ).

按正常讀取網頁.

from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

簡單爬蟲入門使用BeautifulSoup完成上概述任務

按 Class 比對很簡單. 比如我要找所有 class=month 的資訊. 并列印出它們的 tag 内文字.

soup = BeautifulSoup(html, features='lxml')

# use class to narrow search
month = soup.find_all('li', {"class": "month"})
for m in month:
    print(m.get_text())

"""
一月
二月
三月
四月
五月
"""

或者找到 class=jan 的資訊. 然後在

<ul>

下面繼續找

<ul>

内部的

<li>

資訊. 這樣一層層嵌套的資訊, 非常容易找到.

jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li')              # use jan as a parent
for d in d_jan:
    print(d.get_text())

"""
一月一号
一月二号
一月三号
"""

如果想要找到一些有着一定格式的資訊, 比如使用正則表達來尋找相類似的資訊, 我們在 BeautifulSoup 中也能嵌入正規表達式, 讓 BeautifulSoup 更為強大. 怎麼用, 我們就接着往下看啦

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

簡單爬蟲入門使用BeautifulSoup完成上概述任務

比對網頁内容

使用BeautifulSoup完成上概述任務

安裝 Beautiful Soup

安裝解析器

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的

簡單爬蟲入門 使用BeautifulSoup完成上概述任務

比對網頁内容

使用BeautifulSoup完成上概述任務

安裝 Beautiful Soup

安裝解析器

繼續閱讀

簡單爬蟲入門使用BeautifulSoup完成上概述任務