打開新浪微網誌,登入,打開yz的相冊
打開chrome的開發者工具,在Sources中+New snippet
timeout=prompt("Set timeout (Second):");
count=0
current=location.href;
if(timeout>0)
setTimeout('reload()',1000*timeout);
else
location.replace(current);
function reload(){
setTimeout('reload()',1000*timeout);
count++;
console.log('每('+timeout+')秒自動重新整理,重新整理次數:'+count);
window.scrollTo(0,document.body.scrollHeight);
}
右鍵Run,等結束,在Elements中Copy Element
body
儲存為yz.txt
然後執行腳本
import os
from lxml import etree
import requests
import sys
import datetime
html = etree.parse('yz.txt', etree.HTMLParser(encoding='utf-8'))
print(type(html))
ust = html.xpath('//ul/@group_id')
print(type(ust))
curr_time = datetime.datetime.now()
for iul in ust:
print(iul)
print(type(iul))
path = str(iul)
if not "年" in path:
year = str(curr_time.year) + "年"
path = year+path
isExists = os.path.exists(path)
if not isExists:
os.makedirs(path)
else:
print(path)
output = '//ul[@group_id="'
output += str(iul)
output += '"]//img/@src'
print(output)
lst = html.xpath(output)
print(type(lst))
for ili in lst:
print(ili)
link = str(ili)
if not link.startswith('https:'):
link = 'https:' + link
link = link.replace("/thumb300/", "/large/")
print(link)
response = requests.get(link,verify=False)
index = link.rfind('/')
fn = link[index + 1:]
if path.startswith('2010') or path.startswith('2009'):
if not ".jpg" in fn:
fn += ".jpg"
file_name = path+'/'+fn
with open(file_name, "wb") as f:
f.write(response.content)
現在隻有儲存圖檔功能,儲存視訊以後加吧。
一些年代久遠的圖檔竟然沒有字尾
試了這個可以,不過下不全,隻能下200頁
Python爬蟲——批量爬取微網誌圖檔(不使用cookie)