使用 Python 3 編寫簡單爬蟲

本文主要是學習Python3一個小階段的記錄，邊看視訊邊做的一個爬蟲——抓取主播名字和人氣然後進行降序排序，當作複習。

學習視訊：https://coding.imooc.com/class/136.html

···準備工作

1）明确目的：直播平台某個遊戲的主播和人氣

2）找到對應的頁面例如：https://www.huya.com/g/dnf

3）使用浏覽器的審查元素，檢視對應文本位置

···編碼部分

1）模拟http請求發送到伺服器，傳回html

2）使用正規表達式提取關鍵資料

3）對關鍵資料進行精煉和排序

準備工作

打開浏覽器，進入網址https://www.huya.com/g/dnf

按F12打開浏覽器的審查元素，Ctrl+B使用滑鼠選擇元素，找到對應代碼位置，如圖所示：

使用 Python 3 編寫簡單爬蟲使用 Python 3 編寫簡單爬蟲

開始編碼

1）引入模拟http請求所需的内置子產品。

建立一個Spider類，

定義私有方法 __fetch_content()，

定義公開方法 go 作為入口方法：

from urllib import request

class Spider():  # 私有方法，模拟請求，擷取html内容
    url = 'https://www.huya.com/g/dnf'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        html = r.read()  # 這裡儲存的是位元組流檔案
        html = str(htmls,encoding='UTF-8')  # 全部轉換成字元串 
        print (html)  # 輸出結果測試
        return html

    def go(self):  # 公開的入口方法
        html = self.__fetch_content()  # 調用

spider = Spider()
spider.go()

2）成功擷取到html文本内容之後，找到如圖檔所示的代碼塊

使用 Python 3 編寫簡單爬蟲使用 Python 3 編寫簡單爬蟲

可見， class=”txt” 的父标簽 span 下的兩個子标簽 span，正是需要擷取的資料。

是以正規表達式的 pattern 應該為

<span class="txt">[\s\S]*?</span>

其中，中括号裡的是比對的字元集，包含了 \s (空白字元) \S (非空白字元) *
(後續内容)

這樣就可以代表比對的是所有的文本内容。而中括号後的
?
代表了非貪婪模式

意思是後續内容中比對到
</span> 就停止比對。

3）導入正則比對所需要的re子產品

定義一個分析方法 __analysis()來分析并截取對應的資料：

from urllib import request
import re

class Spider():
    url = 'https://www.huya.com/g/dnf'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        html = r.read()
        html = str(html,encoding='UTF-8')
        return html

    def __analysis(self,html):
        root_pattern = '<span class="txt">[\s\S]*?</span>' # 正規表達式

        root_html = re.findall(root_pattern,html)

        print(root_html[]) # 輸出結果測試
        return root_html

    def go(self):
        html = self.__fetch_content()
        html = self.__analysis(html)

spider = Spider()
spider.go()

得到的結果是：

<span class="txt">
        <span class="avatar fl">
            <img data-original="https://huyaimg.msstatic.com/avatar/1069/a8/028a1c545ab90184f502cd7a9b1a2a_180_135.jpg" src="//a.msstatic.com/huya/main/assets/img/default/84x84.jpg" onerror="this.onerror=null; this.src='//a.msstatic.com/huya/main/assets/img/default/84x84.jpg';" alt="AzZ丶狂人" title="AzZ丶狂人">
            <i class="nick" title="AzZ丶狂人">AzZ丶狂人</i>
        </span>

發現由于使用了非貪婪模式的正規表達式，比對到第一個

</span>

隻包含了一個資料。是以需要把右邊界字元換成相對唯一的

</li>

使用 Python 3 編寫簡單爬蟲使用 Python 3 編寫簡單爬蟲

再次運作測試，成功的包含了兩個需要的資料：

<span class="txt">
        <span class="avatar fl">
            <img data-original="https://huyaimg.msstatic.com/avatar/1069/a8/028a1c545ab90184f502cd7a9b1a2a_180_135.jpg" src="//a.msstatic.com/huya/main/assets/img/default/84x84.jpg" onerror="this.onerror=null; this.src='//a.msstatic.com/huya/main/assets/img/default/84x84.jpg';" alt="AzZ丶狂人" title="AzZ丶狂人">
            <i class="nick" title="AzZ丶狂人">AzZ丶狂人</i>
        </span>
                <span class="num"><i class="num-icon"></i><i class="js-num">11.6萬</i></span>
    </span>
</li>

4）對得到的結果再次進行正則比對，分别擷取主播名字和人氣：

from urllib import request
import re

class Spider():
    url = 'https://www.huya.com/g/dnf'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        html = r.read()
        html = str(html,encoding='UTF-8')
        return html

    def __analysis(self,html):
        root_pattern = '<span class="txt">[\s\S]*?</li>'

        # 下清單達式([\s\S]*?)外面還加了一層小括号，僅保留小括号裡的内容
        name_pattern = '<i class="nick" title="[\s\S]*?">([\s\S]*?)</i>'
        number_pattern = '<i class="js-num">([\s\S]*?)</i>'

        root_html = re.findall(root_pattern,html)
        anchors = []  # 建立一個list
        for html in root_html: # 使用for循環，将每個結果存儲為字典，添加到list中。
            name = re.findall(name_pattern, html)
            number = re.findall(number_pattern, html)
            anchor = {'name': name, 'number': number}
            anchors.append(anchor)
        print(anchors[]) # 輸出結果測試
        return anchors

    def go(self):
        html = self.__fetch_content()
        html = self.__analysis(html)

spider = Spider()
spider.go()

得到示例結果：

{'name': ['AzZ丶七天'], 'number': ['6.1萬']}

5）定義一個__refine()方法，對anchors裡的資料進行精煉，即隻保留主幹字元：

strip() 方法用于移除字元串頭尾指定的字元（預設為空格或換行符）或字元序列。

注意：該方法隻能删除開頭或是結尾的字元，不能删除中間部分的字元。

在go()方法中将傳回的結果存儲為list

from urllib import request
import re

class Spider():
    url = 'https://www.huya.com/g/dnf'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        html = r.read()
        html = str(html,encoding='UTF-8')
        return html

    def __analysis(self,html):
        root_pattern = '<span class="txt">[\s\S]*?</li>'
        name_pattern = '<i class="nick" title="[\s\S]*?">([\s\S]*?)</i>'
        number_pattern = '<i class="js-num">([\s\S]*?)</i>'

        root_html = re.findall(root_pattern,html)
        anchors = []
        for html in root_html:
            name = re.findall(name_pattern, html)
            number = re.findall(number_pattern, html)
            anchor = {'name': name, 'number': number}
            anchors.append(anchor)
        return anchors

    def __refine(self, anchors):
        #匿名函數
        f = lambda x: { 
            'name': x['name'][].strip(), # strip()用于移除空白字元
            'number': x['number'][]
        }
        return map(f, anchors) # map()方法，将上述函數作用于list中的每一個元素

    def go(self):
        html = self.__fetch_content()
        html = self.__analysis(html)
        html = list(self.__refine(html)) # 将對象轉換成list
        print (html[])

spider = Spider()
spider.go()

測試的輸出結果為：

{'name': 'AzZ丶七天', 'number': '5.4萬'}

6）定義一個__sort()方法，對anchors進行排序

定義一個__show()方法，将結果列印輸出：

from urllib import request
import re

class Spider():
    url = 'https://www.huya.com/g/dnf'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        html = r.read()
        html = str(html,encoding='UTF-8')
        return html

    def __analysis(self,html):
        root_pattern = '<span class="txt">[\s\S]*?</li>'
        name_pattern = '<i class="nick" title="[\s\S]*?">([\s\S]*?)</i>'
        number_pattern = '<i class="js-num">([\s\S]*?)</i>'

        root_html = re.findall(root_pattern,html)
        anchors = []
        for html in root_html:
            name = re.findall(name_pattern, html)
            number = re.findall(number_pattern, html)
            anchor = {'name': name, 'number': number}
            anchors.append(anchor)
        return anchors

    def __refine(self, anchors):
        l = lambda x: {
            'name': x['name'][].strip(),
            'number': x['number'][]
        }
        return map(l, anchors)

    def __sort(self, anchors):
        # 作為key的number裡存在中文“萬”字，需要重新設定；reverse将升序改為降序
        return sorted(anchors,key=self.__sort_seed,reverse=True)

    def __sort_seed(self, anchors):
        r = re.findall('\d*', anchors['number'])
        # r在這裡是list，[0]是上面表達式比對的數字，其他則仍然是字元串
        number = float(r[])
        if '萬' in anchors['number']:
            number *= 
        return number

    def __show(self, anchors):
        for rank in range(, range(,): #len(anchors)這裡改成輸出前五
            print(rank + , anchors[rank]['name']
                  + '-----' + anchors[rank]['number'])

    def go(self):
        html = self.__fetch_content()
        anchors = self.__analysis(html)
        anchors = list(self.__refine(anchors))
        anchors = self.__sort(anchors)
        anchors = self.__show(anchors)

spider = Spider()
spider.go()

輸出結果：

勝哥-----萬
 AzZ丶小古子-----萬
 AzZ丶小炜-----萬
 AzZ丶仇冬生-----萬
 AzZ丶傑哥助您圓夢-----萬

結語

至此，該簡單爬蟲已經基本實作。

主要是前期的網頁代碼分析，找到關鍵位置的代碼

然後就是擷取html文本，注意編碼格式

分析和截取等等是爬取後的工作

爬蟲的基本思想應該就是擷取和分析提煉

其中又分成其他細小工作

使用 Python 3 編寫簡單爬蟲使用 Python 3 編寫簡單爬蟲

使用 Python 3 編寫簡單爬蟲

準備工作

開始編碼

結語

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入