做B站up主相薄爬虫

2023-04-21 15:12:13

并不保证爬取所有想用的图片

先是对up主相簿页爬所有图片动态的地址存在文本文件中

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from tellwlib.py import download
from selenium.webdriver.common.by import By

url = 'https://space.bilibili.com/177023891/album'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)
#driver = webdriver.Chrome()
driver.get(url)
try:
    for i in range(1, 9):#该up主相簿页面有8页
        WebDriverWait(driver, 20, 0.5).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'picture')))
        doctors = driver.find_elements_by_class_name('picture')
        with open('beyond.txt', 'a') as f:
            for doctor in doctors:
                f.write(doctor.get_attribute('href')+'\n')
        pages = driver.find_elements_by_class_name('panigation')
        for page in pages:
            if page.text == str(i+1):
                break
        ActionChains(driver).move_to_element(page).click(page).perform()
        time.sleep(5)
finally:
    driver.close()

爬到相簿页的最后一页的时候会因为点击最后一个按钮（因为没有第9页的按钮）而报错，程序也应该在此时终止

接下来从文本文件依次取出图片动态页地址访问爬取图片

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from tellwlib.py import download
from selenium.webdriver.common.by import By

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
with open('beyond.txt', 'r') as f:
    contents = f.readlines()
for idx, content in enumerate(contents):
    print('processing %dth link %s'%(idx+1, content.strip()))
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get(content.strip())
    try:
        WebDriverWait(driver, 20, 0.5).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'images')))
        pics = driver.find_elements_by_css_selector('.images > img')
        for pic in pics:
            picurl = pic.get_attribute('src')
            download.download_file(picurl, 'beyond/'+picurl.split('/')[-1])
    finally:
        driver.close()

目前做不出Chrome的单例类或者类变量，所以还是有些耗资源的，希望自己未来能有所进步。

参考链接：

Python+Selenium+ChromeDriver之浏览器爬虫入门

做B站up主相薄爬虫

并不保证爬取所有想用的图片

先是对up主相簿页爬所有图片动态的地址存在文本文件中

爬到相簿页的最后一页的时候会因为点击最后一个按钮（因为没有第9页的按钮）而报错，程序也应该在此时终止

接下来从文本文件依次取出图片动态页地址访问爬取图片

目前做不出Chrome的单例类或者类变量，所以还是有些耗资源的，希望自己未来能有所进步。

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的