天天看点

爬虫(4)-- 并发下载

所谓的并发下载,也就是启动多线程和多进程下载。

多线程爬虫实现部分示例如下,多线程默认内存共享

def process_queue():
    while True:
        try:
            url = crawl_queue.pop()
        except IndexError:
            break
        else:
            html = D(url)  # 下载
            ...
           
threads = []
while threads or crawl_queue:
    for thread in threads:
        if not thread.is_alive():
            threads.remove(thread)
    while len(threads) < max_threads and crawl_queue:
        thread = threading.Thread(target = process_queue)
        thread.setDaemon(True)     # 设为“守护线程”
        thread.start()
        threads.append(thread)
           

当有 url 可爬取时,上面的代码中的循环会不断创建线程,直到达到 线程池的最大值。在爬取过程中,如果当前队列中没有更多的可以爬取的 url 时,线程会提前停止。当发现新的 url 在队列中需要下载时,并且线程数未达到最大值,又会创建一个新的下载线程。

多进程爬虫实现部分示例如下

def threaded_crawler(...):
    ...
    crawl_queue.push(seed_url)
    def process_queue():
        while True:
            try:
                url = crawl_queue.pop()
            except KeyError:
                break
            else:
                ...
                crawl_queue.complete(url)
           
import multiprocessing
def process_link_crawler(args,**kwargs):
    num_cpus = multiprocessing.cpu_count()
    print "Starting {} processes".format(num_cpus)
    processes = []
    for i in range(num_cpus):
        p = multiprocessing.Process(target = threaded_crawler,
                                        args = [args],kwargs = kwargs)
        p.start()
        processes.append(p)
    for p in processes:
        p.join()