本文主要面向python爬蟲初學者

前言
- 一、導入相關庫
- 二、模拟登入
- 二、爬取資訊
- 四、儲存資料
- 五、整體代碼
- 六、這期間的一些坑
- 七、結束語

前言

對這篇部落格的幫助很大的學習資料：

1.網易雲課堂Python網絡爬蟲實戰裡面的視訊很有用，建議認真學一下。

2.部落客kelvinmao的部落格python網絡爬蟲學習(五) 模拟登陸北郵資訊門戶并爬取資訊.讓我減少了登陸驗證的許多繁瑣的事情，但也不知對我的能力提升是好是壞?

一、導入相關庫

import requests
import http.cookiejar as cookielib
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import os
import datetime
today = datetime.date.today().isoformat()#日期格式 2019-07-05
#在桌面建立一個檔案夾用于儲存檔案
folder_path = 'C:/Users/john/OneDrive/桌面/' + today +"/"
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

二、模拟登入

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

分析送出的表單

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

1.username#學号

2.password#密碼

3.lt #這個是webflow發放的流水号

4.execution#細心即可發現，是一個不變的值

5._eventId:#也是一個不變的值

6.rmShown#同是一個不變的值

ps：關于lt：按我的了解解釋一下：打開網頁（即GET請求）時，會有一個流水号，我們可以在網頁源代碼中找到它。現在又出現一個問題，第二次POST請求（攜帶表單資料）時，lt就會變化，如何解決？答：可以使用requests的session方法來保持cookie，lt等參數不變。（相當于還是第一次的請求，不過是攜帶了資料）

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

#模拟一個浏覽器頭
header={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
#setting cookie
s=requests.Session()
s.cookies=cookielib.CookieJar()
r=s.get('https://auth.bupt.edu.cn/authserver/login?service=http%3A%2F%2Fmy.bupt.edu.cn%2Findex.portal',headers=header)
dic=getLt(r.text)

def getLt(str):
    lt=bs(str,'html.parser')
    dic={}
    for inp in lt.form.find_all('input'):
        if(inp.get('name'))!=None:
            dic[inp.get('name')]=inp.get('value')
    print(dic)
    return dic

postdata={
    'username':'######',#此處為你的學号
    'password':'######',#你的密碼
    'lt':dic['lt'],
    'execution':'e1s1',
    '_eventId':'submit',
    'rmShown':'1'
}

二、爬取資訊

1.登入

#攜帶登陸資料，以post方式登入，
response=s.post('https://auth.bupt.edu.cn/authserver/login?service=http%3A%2F%2Fmy.bupt.edu.cn%2Findex.portal',data=postdata,headers=header)
#用get方式通路“校内通知”的頁面
res=s.get('http://my.bupt.edu.cn/index.portal?.pn=p1778',headers=header)
#用beautifulsoup解析html
soup=bs(res.text,'html.parser')

2.查找目标url

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

如上圖，a标簽的href加上字首就是超連結目标的URL，是我需要的資訊，但是發現這個a标簽既沒有id，也沒有class，是以使用re.compile（）函數，發現每個href的前半部分都是一樣的,故使字元串“detach"來進行比對

url=[]
for j in soup.find_all(href=re.compile("detach")):
    url.append('http://my.bupt.edu.cn/'+j.get('href'))

3.查找通知釋出的日期。

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

日期的class為time，故代碼如下

date=[]
for j in soup.find_all(class_='time'):
    date.append(j.string)

4.爬取内文資訊

①标題的class=’.text-center’，使用soup.select(）函數

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

②具體内容的class='singleinfo’，裡面的全部的p标簽的内容需要合并

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

③如果有檔案，需要下載下傳下來，檔案的url處理方法和二、2一樣，采用re.compile（）函數；下載下傳時使用open（），參數‘wb’（以二進制格式打開一個檔案隻用于寫入。如果該檔案已存在則打開檔案，并從開頭開始編輯，即原有内容會被删除。如果該檔案不存在，建立新檔案。）

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

此函數功能：傳入url，下載下傳檔案，傳回一個dict，包含title和article。

def getNewsDetail(newsurl):
    result={}
    res=s.get(newsurl,headers=header)
    res.encoding='utf-8'
    soup=bs(res.text,'html.parser')
    result['Title']=soup.select('.text-center')[0].text
    article=[]
    for p in soup.select('.singleinfo p'):
       article.append(p.text.strip())
    result['article']=article[0]
    downloadurl=[]
    filename=[]
    Docurl=soup.find_all(href=re.compile("attachment"))
    for k in Docurl:
        downloadurl.append('http://my.bupt.edu.cn/'+k.get('href'))
        filename.append(k.string)
    if  filename:
        for k in range(0,len(filename)):
            download=s.get(downloadurl[k],headers=header)
            with open(folder_path+filename[k],"wb") as f:
                f.write(download.content)
            f.close()
    return result

四、儲存資料

如果日期為今天，就調用函數getNewsDetail(）然後存入news_totol[],最後使用pandas的dataframe（）和to_excel（），儲存檔案。

news_total=[]
for i in range(0,29):
    if date[i]!=today+' ':
        continue
    newsary=getNewsDetail(url[i])
    news_total.append(newsary)
df=pd.DataFrame(news_total)
df.to_excel(folder_path+'news.xlsx')

五、整體代碼

# -*- coding: utf-8 -*-
"""
Created on Fri Jul  5 16:49:28 2019

@author: byrwyj
"""

import requests
import http.cookiejar as cookielib
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import os
import datetime
today = datetime.date.today().isoformat()
folder_path = 'C:/Users/john/OneDrive/桌面/' + today +"/"
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

def getLt(str):
    lt=bs(str,'html.parser')
    dic={}
    for inp in lt.form.find_all('input'):
        if(inp.get('name'))!=None:
            dic[inp.get('name')]=inp.get('value')
    return dic

header={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}

#setting cookie
s=requests.Session()
s.cookies=cookielib.CookieJar()
r=s.get('https://auth.bupt.edu.cn/authserver/login?service=http%3A%2F%2Fmy.bupt.edu.cn%2Findex.portal',headers=header)
dic=getLt(r.text)
postdata={
    'username':'######',#此處為你的學号
    'password':'######',#你的密碼
    'lt':dic['lt'],
    'execution':'e1s1',
    '_eventId':'submit',
    'rmShown':'1'
}

def getNewsDetail(newsurl):
    result={}
    res=s.get(newsurl,headers=header)
    res.encoding='utf-8'
    soup=bs(res.text,'html.parser')
    result['Title']=soup.select('.text-center')[0].text
    article=[]
    for p in soup.select('.singleinfo p'):
       article.append(p.text.strip())
    result['article']=article[0]
    downloadurl=[]
    filename=[]
    Docurl=soup.find_all(href=re.compile("attachment"))
    for k in Docurl:
        downloadurl.append('http://my.bupt.edu.cn/'+k.get('href'))
        filename.append(k.string)
    if  filename:
        for k in range(0,len(filename)):
            download=s.get(downloadurl[k],headers=header)
            with open(folder_path+filename[k],"wb") as f:
                f.write(download.content)
            f.close()
    return result

response=s.post('https://auth.bupt.edu.cn/authserver/login?service=http%3A%2F%2Fmy.bupt.edu.cn%2Findex.portal',data=postdata,headers=header)
res=s.get('http://my.bupt.edu.cn/index.portal?.pn=p1778',headers=header)
soup=bs(res.text,'html.parser')
news_total=[]
date=[]
url=[]
for j in soup.find_all(href=re.compile("detach")):
    url.append('http://my.bupt.edu.cn/'+j.get('href'))
for j in soup.find_all(class_='time'):
    date.append(j.string)
for i in range(0,29):
    if date[i]!=today+' ':
        continue
    newsary=getNewsDetail(url[i])
    news_total.append(newsary)
df=pd.DataFrame(news_total)
df.to_excel(folder_path+'news.xlsx')

六、這期間的一些坑

1.看教學視訊使用2倍速，還不聽聲音，結果大的架構沒學會，又得回去看?

2.被正規表達式弄得頭暈，後來發現又可以不用它，正規表達式真是得細心學，一點一點寫，不能急。

3.到最後的時候被檔案路徑的/還是\給弄暈了，導緻檔案怎麼也删除不掉，提示路徑錯誤。重新開機、壓縮後删除、檔案粉碎器都試過了，仍是不行。這狗皮膏藥真是煩人，最終還是被解決了?。①.打開記事本輸入

DEL /F /A /Q \?%1

RD /S /Q \?%1

②.把記事本另存為del.bat

③.把要删除檔案用滑鼠拖入del.bat，删除成功！

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

DEL /F /A /Q \?%1

RD /S /Q \?%1

全句意思是：強制删除系統檔案夾下所有的格式為tmp的檔案(哪怕檔案是隻讀的)，并且在删除時不用向使用者詢問是否繼續或終止!

del 删除指令。

/F 強制删除隻讀檔案。

/S 從所有子目錄删除指定檔案。

/Q 安靜模式。删除全局通配符時，不要求确認。

/A 根據屬性選擇要删除的檔案。

/S 除目錄本身外，還将删除指定目錄下的所有子目錄和

檔案。用于删除目錄樹。

/Q 安靜模式，加 /S 時，删除目錄樹結構不再要求确認

\?%1 表示是此檔案自己

七、結束語

這篇代碼肯定有許多不完備的地方，但我寫的少，也改進不了什麼。還是自己太菜?。閑暇的時間才是自己的，才能做一些事情，這篇文章花了我兩天時間，挺好，希望自己能學到更多的知識。

python3.7爬蟲初學者執行個體、北郵資訊門戶、爬取每日通知并下載下傳相關檔案一齊儲存到桌面前言

本文主要面向python爬蟲初學者

前言

一、導入相關庫

二、模拟登入

二、爬取資訊

四、儲存資料

五、整體代碼

六、這期間的一些坑

七、結束語

繼續閱讀

Python爬蟲之網站超清圖檔爬取(2021.3.29)

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰