起因

起因就是部落客自己頭腦發熱想要爬取豆瓣的電影榜單,然後把裡面的文字部分提取出來

自己之前看别人的網上的教程寫到了如何爬取某個網頁的圖檔并将其下載下傳,自己試了好使但是感覺很不過瘾,于是自己也花了一兩天的時間研究爬蟲,總算是研究明白了,給大家分享.

爬蟲原理

爬蟲原理很簡單,你通路一個網頁,網頁會回給你一個HTML文檔,你通過python的正規表達式也好,beautifulSoup庫也好,Xpath也好,通過這幾種技術方式得到你想要的元素,然後把該元素提取出來即可,非常簡單.總而言之爬蟲真的是一個非常簡單的東西,大家千萬不要把它想象的太難了.

用到的技術

多路複用的爬取技術

beautifulSoup庫的文檔解析技術

代碼如下

實作了解析豆瓣電影榜單,并且能夠提取出來文字部分,值得注意的是,爬蟲如果僅僅是簡單的爬蟲,确實不難,我覺得更重點的是放在爬取内容的解析上面!!!這個才是有含金量的工作

#!/usr/bin/python
# encoding: utf-8
from bs4 import BeautifulSoup
import urllib
import grequests

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" target="_blank" rel="external nofollow"  class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" target="_blank" rel="external nofollow"  class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" target="_blank" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
list = []

def findINFO(url):
    if url in list:
        return
    list.append(url)
    html1 = urllib.urlopen(url).read()
    vvv = ReturnList(html1)
    for vv in vvv:
        findINFO(vv)
    print "測試"

    # for pp in soup.find(class_="paginator").find_all("a"):
    #     findINFO(pp.attrs["href"])

def find_group(urls):
    urls = [url for url in urls if url not in list]
    if len(urls) == :
        return

    new_urls_set = set([])

    rs = (grequests.get(u) for u in urls)
    for res in grequests.map(rs):
        list.append(res.url)
        for url in ReturnList(res.text):
            new_urls_set.add(url)

    find_group(new_urls_set)



def ReturnList(HTML):
    list  = []
    soup = BeautifulSoup(HTML)
    #分析頁面的代碼
    for link in soup.find_all(class_="doulist-item"):
        print link.find(class_="title").a.get_text()
        print link.find(class_="rating").find(class_="rating_nums").get_text()
        print link.find(class_="rating").find_all("span")[].get_text()
        print link.find(class_="abstract").get_text()

    for pp in soup.find(class_="paginator").find_all("a"):
        list.append(pp.attrs["href"])
    return list



#findINFO("https://www.douban.com/doulist/240962/")
#print list
find_group(["https://www.douban.com/doulist/240962/"])

python爬蟲爬取豆瓣電影榜單起因爬蟲原理用到的技術代碼如下

起因

爬蟲原理

用到的技術

代碼如下

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

403 Forbidden，You don't have permission to access / on this server.Forbidden

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入