python爬蟲入門(4)

bs4庫實踐

文章目錄

python爬蟲入門(4)
- 一.資訊組織與提取方法
- - 1.三種主要的資訊組織形式
  - 2.資訊提取的一般方法
  - - 執行個體：提取HTML中所有URL連結
  - 3.基于bs4庫的HTML内容查找方法
  - - find_all方法
    - - name參數
      - attrs參數
      - recursive參數
      - string參數
- 二.執行個體：爬取菜價
- - 1.檢視網頁源代碼
  - 2.檢驗狀态碼
  - 3.生成Beautiful對象
  - 4.查找資料(菜價表)
  - 5.在菜價表裡找具體資料
  - 6.寫入檔案
  - 7.完整代碼
- 三.執行個體：爬取圖檔
- - 1.檢視網頁源碼
  - 2.檢視狀态碼，編碼
  - 3.生成Beautiful對象
  - 4.在首頁面找到分區
  - 5.拿到各個所需圖檔的所在網址的url
  - 6.在各個子頁面找圖檔的下載下傳位址
  - 7.下載下傳圖檔
  - 8.完整代碼
- 四.遇到的問題
- - 1.writerow

一.資訊組織與提取方法

1.三種主要的資訊組織形式

XML

JSON

YAML

python爬蟲入門(4)python爬蟲入門(4)

2.資訊提取的一般方法

從标記過的資訊中提取所關注的内容

python爬蟲入門(4)python爬蟲入門(4)

執行個體：提取HTML中所有URL連結

方法：

(1)

搜尋到所有

<a>

标簽

(2)

解析

<a>

标簽格式，提取

href

後的連結内容

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

運作結果

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

3.基于bs4庫的HTML内容查找方法

find_all方法

python爬蟲入門(4)python爬蟲入門(4)

name參數

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.find_all('a'))
print(soup.find_all(['a','b']))

運作結果

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

for tag in soup.find_all(True):
    print(tag.name)

運作結果

html
head
title
body
p
b
p
a
a

attrs參數

print(soup.find_all('p','course'))
print(soup.find_all(id='link1'))

運作結果

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

import re
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.find_all(id=re.compile('link')))

這裡用到了正規表達式，會輸出id以link開頭的資訊

運作結果

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic 
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive參數

預設 True ，若想隻搜尋兒子節點，就設定為 False

print(soup.find_all('a'))
print(soup.find_all('a',recursive=False))#僅搜尋兒子節點

string參數

print(soup.find_all(string="Basic Python"))
print(soup.find_all(string=re.compile("python")))#正規表達式

運作結果

['Basic Python']
['This is a python demo page', 'The demo python introduces several python courses.']

由于find_all很常用,故提供了兩種簡單的等價形式

python爬蟲入門(4)python爬蟲入門(4)

python爬蟲入門(4)python爬蟲入門(4)

二.執行個體：爬取菜價

爬取北京新發地市場上的菜價資訊

python爬蟲入門(4)python爬蟲入門(4)

1.檢視網頁源代碼

檢視源代碼，看源代碼中是否直接顯示需要爬取的資訊

python爬蟲入門(4)python爬蟲入門(4)

發現有，進行下一步

2.檢驗狀态碼

import  requests
url="http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
resp=requests.get(url)
print(resp.status_code)

傳回值為

,可以正常通路

爬取網頁源代碼

中文可以正常顯示，字元集沒有問題

3.生成Beautiful對象

不要忘記解釋器

#指定html解析器
page=BeautifulSoup(resp.text,"html.parser")

4.查找資料(菜價表)

方法：

find(标簽，屬性=屬性值) (find隻找一個)

find_all

所找内容在一個表格裡面，先找 <table> 有很多 <table> 标簽，他們的不同就是 class 屬性的屬性值不同

python爬蟲入門(4)python爬蟲入門(4)

#find方法
table=page.find("table",class_="hq_table")

因為class是python的關鍵字,是以易混淆，bs4提供了加一個下劃線的方法來區分

另一種寫法：

此寫法避免class問題

5.在菜價表裡找具體資料

<tr>
代表行

<td> 代表列

python爬蟲入門(4)python爬蟲入門(4)

圖中所顯示的是第一個

<tr>

标簽出現的地方

而後面第二個

<tr>

标簽開始才是正式的所要提取的内容

python爬蟲入門(4)python爬蟲入門(4)

要在每個

<tr>

标簽裡找到所有

<td>

标簽，分别對應品名，最低價，平均價，最高價，規格，機關，釋出日期

具體代碼如下：

trs=table.find_all("tr")[1:]#第一行表頭先不提取
for tr in trs:#每一行
    tds=tr.find_all("td")
    # .text表示拿到被标簽标記的内容
    name =tds[0].text  #品名
    lowest =tds[1].text #最低價
    mean =tds[2].text #平均價
    highest=tds[3].text #最高價
    guige =tds[4].text #規格
    unit =tds[5].text #機關
    date =tds[6].text #釋出日期
    print(name,lowest,mean,highest,guige,unit,date)

python文法沒過關, 🌵🐶,去搜了個python切片的教程

6.寫入檔案

import csv
f=open("菜價.csv",mode="w")
csvwriter = csv.writer(f)

7.完整代碼

#爬取菜價
import  requests
from bs4 import  BeautifulSoup
import csv
url="http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml"
resp=requests.get(url)
#指定html解析器
page=BeautifulSoup(resp.text,"html.parser")
# table=page.find("table",class_="hq_table")
f=open("菜價.csv",mode="w")
csvwriter = csv.writer(f)

table=page.find("table",attrs={"class":"hq_table"})
#拿到每一行的資料
trs=table.find_all("tr")[1:]#第一行表頭先不提取
for tr in trs:#每一行
    tds=tr.find_all("td")
    # .text表示拿到被标簽标記的内容
    name =tds[0].text  #品名
    lowest =tds[1].text #最低價
    mean =tds[2].text #平均價
    highest=tds[3].text #最高價
    guige =tds[4].text #規格
    unit =tds[5].text #機關
    date =tds[6].text #釋出日期
    csvwriter.writerow([name,lowest,mean,highest,guige,unit,date])

f.close()
print("done!")
resp.close()

三.執行個體：爬取圖檔

目标，在優美圖庫網站上爬取動畫圖檔

python爬蟲入門(4)python爬蟲入門(4)

1.檢視網頁源碼

所要爬取的是具體的圖檔，首先看所需爬取内容是否在源碼中有展現

python爬蟲入門(4)python爬蟲入門(4)
比如第一張雙刀索隆，是有的，那就可以繼續了

2.檢視狀态碼，編碼

import requests
from bs4 import BeautifulSoup

url = "https://umei.net/katongdongman/"
resp = requests.get(url)
print(resp.status_code)
print(resp.apparent_encoding)
print(resp.encoding)

狀态碼是200

resp.apparent_encoding

是

utf-8

resp.encoding

是

ISO-8859-1

是以需要修改編碼

resp.encoding=resp.apparent_encoding

檢驗編碼除了上面那種列印兩種編碼對比是否相同的方法，還可以先print(resp.text)，然後看是否有亂碼，有亂碼就去看charset裡面寫的編碼，然後把resp.encoding改為那種編碼即可

3.生成Beautiful對象

4.在首頁面找到分區

python爬蟲入門(4)python爬蟲入門(4)

找到有辨識度網頁源碼，

ctrl+F

搜尋發現網頁源碼中隻有這一處有

TypeList,是以可以選擇這個

5.拿到各個所需圖檔的所在網址的url

childs = ChildPages.find_all("a")#find_all方法會傳回一個清單
for child in childs:
    # 從BeautifulSoup對象拿到裡面的href屬性的值，直接用get
    print(child.get("href"))

python爬蟲入門(4)python爬蟲入門(4)

發現不是完整的域名，要拼接一下

for child in childs:
    # 從BeautifulSoup對象拿到裡面的href屬性的值，直接用get
    # print(child.get("href"))

    child_page_resp = requests.get("https://umei.net/"+child.get("href"))
    child_page_resp.encoding = 'utf-8'
    child_page_text = child_page_resp.text

6.在各個子頁面找圖檔的下載下傳位址

以雙刀索隆那張圖為例，在那個子頁面找辨識度高的源代碼

python爬蟲入門(4)python爬蟲入門(4)

for child in childs:
    # 從BeautifulSoup對象拿到裡面的href屬性的值，直接用get
    # print(child.get("href"))

    child_page_resp = requests.get("https://umei.net/"+child.get("href"))
    child_page_resp.encoding = 'utf-8'
    child_page_text = child_page_resp.text

    child_page = BeautifulSoup(child_page_text,"html.parser")
    p=child_page.find("p",align="center")
    img =p.find("img")
    print(img)
    break
    #先試一下效果，是以加一個break

成功！

python爬蟲入門(4)python爬蟲入門(4)

實際需要的是src屬性

在标簽裡面拿到屬性，用.get()方法

7.下載下傳圖檔

請求圖檔的位址即可

img = p.find("img")
 	   src = img.get("src")

  	  # 下載下傳圖檔
  	  img_resp = requests.get(src)
  	  img_name = src.split("/")[-1]
   	 # 拿到url中的最後一個/後面的内容
   	  with open("picture/"+img_name, mode="wb")as f:  # 放	在提前建好的picture檔案夾裡面
       	# 圖檔内容寫入檔案
      f.write(img_resp.content)  # img_resp.content是位元組
      time.sleep(1)

為了了解進度，可以加入提示語

#如：
	print('done')

防止ip被封

import time

#在for循環中：
	time.sleep(1)

mode='rb'

python爬蟲入門(4)python爬蟲入門(4)

done！

python爬蟲入門(4)python爬蟲入門(4)

8.完整代碼

import requests
from bs4 import BeautifulSoup
import time

url = "https://umei.net/katongdongman/"
resp = requests.get(url)

resp.encoding = resp.apparent_encoding
mainpage = BeautifulSoup(resp.text, "html.parser")

# 在首頁面找到分區

ChildPages = mainpage.find("div", attrs={"class": "TypeList"})
# print(ChildPages)
childs = ChildPages.find_all("a")
for child in childs:
    # 從BeautifulSoup對象拿到裡面的href屬性的值，直接用get
    # print(child.get("href"))

    child_page_resp = requests.get("https://umei.net/"+child.get("href"))
    child_page_resp.encoding = 'utf-8'
    child_page_text = child_page_resp.text

    child_page = BeautifulSoup(child_page_text, "html.parser")
    p = child_page.find("p", align="center")
    img = p.find("img")
    src = img.get("src")

    # 下載下傳圖檔
    img_resp = requests.get(src)
    img_name = src.split("/")[-1]
    # 拿到url中的最後一個/後面的内容
    with open("picture/"+img_name, mode="wb")as f:  # 放在提前建好的picture檔案夾裡面
       # 圖檔内容寫入檔案
        f.write(img_resp.content)  # img_resp.content是位元組
        time.sleep(1)
    print('done',img_name)

print('all done!')

resp.close()

四.遇到的問題

1.writerow

python爬蟲入門(4)python爬蟲入門(4)

一開始

csvwriter.writerow([name,lowest,mean,highest,guige,unit,date])

寫成了

csvwriter.writerow(name,lowest,mean,highest,guige,unit,date)

python爬蟲入門(4)python爬蟲入門(4)

python爬蟲入門(4)

文章目錄

一.資訊組織與提取方法

1.三種主要的資訊組織形式

2.資訊提取的一般方法

執行個體：提取HTML中所有URL連結

3.基于bs4庫的HTML内容查找方法

find_all方法

name參數

attrs參數

recursive參數

string參數

二.執行個體：爬取菜價

1.檢視網頁源代碼

2.檢驗狀态碼

3.生成Beautiful對象

4.查找資料(菜價表)

5.在菜價表裡找具體資料

6.寫入檔案

7.完整代碼

三.執行個體：爬取圖檔

1.檢視網頁源碼

2.檢視狀态碼，編碼

3.生成Beautiful對象

4.在首頁面找到分區

5.拿到各個所需圖檔的所在網址的url

6.在各個子頁面找圖檔的下載下傳位址

7.下載下傳圖檔

8.完整代碼

四.遇到的問題

1.writerow

繼續閱讀