python爬蟲爬取goubanjia的代理ip

2023-05-06 06:40:35

今天這裡介紹一下python3爬取http://www.goubanjia.com的代理ip的方法，這個網站的html有點變态，還做了js加密。對于初學python的我還是有一定的難道，但是研究了一段時間，寫下了一個demo。接下來跟他家分享一下。

from bs4 import BeautifulSoup
from urllib import parse,request
class Spider:
    def __init__(self):
        self.beginPage = int(input("請輸入起始頁："))
        self.endPage = int(input("請輸入終止頁："))

        self.url = 'http://www.goubanjia.com/free/'
        self.ua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

    def tiebaSpider(self):
        for page in range(self.beginPage, self.endPage + ):
            #拼接通路的位址
            myUrl = self.url +"index%d.shtml"%page
            self.loadPage(myUrl)


    # 讀取頁面内容
    def loadPage(self, url):
        req = request.Request(url, headers = self.ua_header)
        resHtml = request.urlopen(req).read()

        # 解析html
        html=BeautifulSoup(resHtml,"lxml")
        #擷取是以的ip的td
        tdResultList=html.select('td[class="ip"]')
        for tdResult in tdResultList:
            #擷取目前td是以的子标簽
            childList= tdResult.find_all()
            text=""
            for child in childList:
                if 'style' in child.attrs.keys():
                    if child.attrs['style'].replace(' ','')=="display:inline-block;":
                        if child.string!=None:
                            text=text+child.string
                #過濾出端口号
                elif 'class'in child.attrs.keys():
                    classList=child.attrs['class']
                    if 'port' in classList:
                        port=self.get_poxy(classList[])
                        #拼接端口
                        text=text+":"+str(port)

                else:
                    if child.string != None:
                        text = text + child.string

            #寫入到檔案
            self.writeToTxt(text)

    #解碼端口号
    def get_poxy(self,port_word):
        word = list(port_word)
        num_list = []
        for item in word:
            num = 'ABCDEFGHIZ'.find(item)
            num_list.append(str(num))

        port = int("".join(num_list)) >> 
        return port


    def writeToTxt(self, text):
        txtFile = open("portFile.txt", 'a+')
        txtFile.write(text+"\n")
        txtFile.close()

# 模拟 main 函數
if __name__ == "__main__":

    # 首先建立爬蟲對象
    mySpider = Spider()
    # 調用爬蟲對象的方法，開始工作
    mySpider.tiebaSpider()

因為初學python代碼寫的還不夠好，希望大家多提意見，共同學習，也希望對你有所幫助。

python爬蟲爬取goubanjia的代理ip

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入