yande.re 爬虫自动爬取网站上的图片

2023-04-08 07:33:17

最近想要自动下载一些東方project同人图，看了看几家同人图网站，最后想先拿yande.re试手，不用登陆，页面上也没有乱七八糟的广告混淆视听，而且图片普遍质量很高，以下是代码：

如果有想用的，直接修改tag，文件保存路径，页数就可以直接用了

import urllib
import urllib.request
import re
import time
import os
def delRepeat(a):
    for x in a:
        while a.count(x)>1:
            del a[a.index(x)]
    return a

def name(photo):
    a = photo[33:]
    b = a.replace("%20", "_").replace("%28", "(").replace("%29", ")")
    return b


def save_img(img_url,file_name,file_path='D:\图片\从yande爬的图'):
    #保存图片到磁盘文件夹 file_path中
    try:
        if not os.path.exists(file_path):
            print ('文件夹',file_path,'不存在，重新建立')
            os.makedirs(file_path)
        file_suffix = os.path.splitext(img_url)[1]  # 获得图片后缀
        filename = '{}{}{}{}'.format(file_path,os.sep,file_name,file_suffix)  # 拼接图片名（包含路径）
        urllib.request.urlretrieve(img_url,filename=filename)  # 下载图片，并保存到文件夹中
    except IOError as e:
        print ('文件操作失败',e)
    except Exception as e:
        print ('错误 ：',e)


for page in range(1,10): # 填入爬取1-10页
    time.sleep(3)
    url = "https://yande.re/post?page=" + str(page) + "&tags=touhou"  # 这个tag自己填写
    html = urllib.request.urlopen(url).read().decode("utf-8", "ignore")
    find_index = re.findall(r'id="p\d{3,}', html)
    for each in find_index:  # 搜索每一张图
        try:
            time_start = time.time()
            count += 1
            words = "正在保存第"+ str(page) + "页，第" + str(count) + "张图"
            print (words, end = '')
            n =each[5:]
            page_url = "https://yande.re/post/show/" + str(n)
            page_html = urllib.request.urlopen(page_url).read().decode("utf-8", "ignore")
            photo_find = delRepeat(re.findall(r'href="https://files.yande.re/image/([\s\S]*?)" target="_blank" rel="external nofollow" ', page_html))
            if(len(photo_find)==0):
               continue
            photo_url = "https://files.yande.re/image/" + photo_find[0]
            photo_name = name(photo_find[0])
            save_img(photo_url, photo_name)
            time_end = time.time()
            print(' 用时%d秒 ' %(time_end - time_start))
        except urllib.error.URLError or socket.gaierror or NameError or ConnectionAbortedError as e:
            print('错误 ：', e)
            continue

yande.re 爬虫自动爬取网站上的图片

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

yande.re 爬虫 自动爬取网站上的图片

继续阅读

yande.re 爬虫自动爬取网站上的图片