爬虫（4）-- 并发下载

2023-04-18 03:33:12

所谓的并发下载，也就是启动多线程和多进程下载。

多线程爬虫实现部分示例如下，多线程默认内存共享

def process_queue():
    while True:
        try:
            url = crawl_queue.pop()
        except IndexError:
            break
        else:
            html = D(url)  # 下载
            ...

threads = []
while threads or crawl_queue:
    for thread in threads:
        if not thread.is_alive():
            threads.remove(thread)
    while len(threads) < max_threads and crawl_queue:
        thread = threading.Thread(target = process_queue)
        thread.setDaemon(True)     # 设为“守护线程”
        thread.start()
        threads.append(thread)

当有 url 可爬取时，上面的代码中的循环会不断创建线程，直到达到线程池的最大值。在爬取过程中，如果当前队列中没有更多的可以爬取的 url 时，线程会提前停止。当发现新的 url 在队列中需要下载时，并且线程数未达到最大值，又会创建一个新的下载线程。

多进程爬虫实现部分示例如下

def threaded_crawler(...):
    ...
    crawl_queue.push(seed_url)
    def process_queue():
        while True:
            try:
                url = crawl_queue.pop()
            except KeyError:
                break
            else:
                ...
                crawl_queue.complete(url)

import multiprocessing
def process_link_crawler(args,**kwargs):
    num_cpus = multiprocessing.cpu_count()
    print "Starting {} processes".format(num_cpus)
    processes = []
    for i in range(num_cpus):
        p = multiprocessing.Process(target = threaded_crawler,
                                        args = [args],kwargs = kwargs)
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

爬虫（4）-- 并发下载

继续阅读

v2ex的简单爬虫

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的