天天看點

python爬蟲爬取豆瓣電影榜單起因爬蟲原理用到的技術代碼如下

起因

起因就是部落客自己頭腦發熱想要爬取豆瓣的電影榜單,然後把裡面的文字部分提取出來

自己之前看别人的網上的教程寫到了如何爬取某個網頁的圖檔并将其下載下傳,自己試了好使但是感覺很不過瘾,于是自己也花了一兩天的時間研究爬蟲,總算是研究明白了,給大家分享.

爬蟲原理

爬蟲原理很簡單,你通路一個網頁,網頁會回給你一個HTML文檔,你通過python的正規表達式也好,beautifulSoup庫也好,Xpath也好,通過這幾種技術方式得到你想要的元素,然後把該元素提取出來即可,非常簡單.總而言之爬蟲真的是一個非常簡單的東西,大家千萬不要把它想象的太難了.

用到的技術

多路複用的爬取技術

beautifulSoup庫的文檔解析技術

代碼如下

實作了解析豆瓣電影榜單,并且能夠提取出來文字部分,值得注意的是,爬蟲如果僅僅是簡單的爬蟲,确實不難,我覺得更重點的是放在爬取内容的解析上面!!!這個才是有含金量的工作

#!/usr/bin/python
# encoding: utf-8
from bs4 import BeautifulSoup
import urllib
import grequests

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
list = []

def findINFO(url):
    if url in list:
        return
    list.append(url)
    html1 = urllib.urlopen(url).read()
    vvv = ReturnList(html1)
    for vv in vvv:
        findINFO(vv)
    print "測試"

    # for pp in soup.find(class_="paginator").find_all("a"):
    #     findINFO(pp.attrs["href"])

def find_group(urls):
    urls = [url for url in urls if url not in list]
    if len(urls) == :
        return

    new_urls_set = set([])

    rs = (grequests.get(u) for u in urls)
    for res in grequests.map(rs):
        list.append(res.url)
        for url in ReturnList(res.text):
            new_urls_set.add(url)

    find_group(new_urls_set)



def ReturnList(HTML):
    list  = []
    soup = BeautifulSoup(HTML)
    #分析頁面的代碼
    for link in soup.find_all(class_="doulist-item"):
        print link.find(class_="title").a.get_text()
        print link.find(class_="rating").find(class_="rating_nums").get_text()
        print link.find(class_="rating").find_all("span")[].get_text()
        print link.find(class_="abstract").get_text()

    for pp in soup.find(class_="paginator").find_all("a"):
        list.append(pp.attrs["href"])
    return list



#findINFO("https://www.douban.com/doulist/240962/")
#print list
find_group(["https://www.douban.com/doulist/240962/"])