IP代理池的Python實作

爬蟲采集資料時，如果頻繁的通路某個網站，會被封IP，有些是禁止通路3小時，有些是直接拉黑名單。為了避免被禁，一般采取的措施有三種：

放慢抓取的速度，設定一個時間間隔；

模拟浏覽器行為，如采用Selenium + PhantomJS；

設定IP代理，定期更換代理IP，讓網站不認為來自一個IP。

本文實作其中的第三種方法。

國内提供IP代理的網站有很多，我們以其中的一個為例：http://www.haodailiip.com

分為三步來實作這個IP抓取類：

解析網頁中的IP和端口

Ping所有IP位址的連接配接速度

按速度從快到慢排序，儲存到檔案

一、解析網頁中的IP和端口

抓取網頁采用的是 urlib + BeautifulSoup。

解析網站：http://www.haodailiip.com/guonei/page，page=1,2…,10

def parse(url):
        try:
            page = urllib.urlopen(url)
            data =  page.read()
            soup = BeautifulSoup(data, "html5lib")
            print soup.get_text()
            body_data = soup.find('table', attrs={'class':'content_table'})
            res_list = body_data.find_all('tr')
            for res in res_list:
                each_data = res.find_all('td')
                if len(each_data) > 3 and not 'IP' in each_data[0].get_text() and '.' in each_data[0].get_text():
                    print each_data[0].get_text().strip(), each_data[1].get_text().strip()
                    item = IPItem()
                    item.ip = each_data[0].get_text().strip()
                    item.port = each_data[1].get_text().strip()
                    item.addr = each_data[2].get_text().strip()
                    item.tpye = each_data[3].get_text().strip()
                    self.ip_items.append(item)
        except Exception,e:
            print e

BeautifulSoup預設的解析器是lxml，但對于這個網址，發現網頁内容解析的不完整，于是用了解析性最好的 html5lib，速度上會稍慢。

關于BeautifulSoup解析器的介紹見http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id9。

BS解析的過程是：

* 先找到table class=”content_table”的标簽；

* 在從上面的内容中找所有tr

* 我們需要的資訊在tr的td中

* 結果存入IPItem類。

IPItem的定義

class IPItem:
    def __init__(self):
        self.ip = ''    # IP
        self.port = ''  # Port
        self.addr = ''  # 位置
        self.tpye = ''  #類型:http; https
        self.speed = -1 #速度

二、Ping所有IP位址的連接配接速度

import pexpect
def test_ip_speed(ip_items):
    tmp_items = []
    for item in ip_items:

        (command_output, exitstatus) = pexpect.run("ping -c1 %s" % item.ip, timeout=5, withexitstatus=1)
        if exitstatus == 0:
            print command_output
            m = re.search("time=([\d\.]+)", command_output)
            if m:
                print 'time=', m.group(1)
                item.speed = float(m.group(1))
                tmp_items.append(item)

   ip_items = tmp_items

主要是利用pexpect子產品調用系統的ping指令，上面代碼在mac 10.11.1下測試通過。

三、按速度從快到慢排序，儲存至檔案

儲存至檔案利用pandas子產品，隻需一句代碼即可搞定。

1. 先把ip_items轉換成pandas的DataFrame；

2. 排序，df.sort_index()，按’Speed’列排序；

3. 結果寫入Excel檔案，to_excel()

def save_data(self):
        df = DataFrame({'IP':[item.ip for item in ip_items],
                        'Port':[item.port for item in self.ip_items],
                        'Addr':[item.addr for item in self.ip_items],
                        'Type':[item.tpye for item in self.ip_items],
                        'Speed':[item.speed for item in self.ip_items]
                        }, columns=['IP', 'Port', 'Addr', 'Type', 'Speed'])
        print df[:10]
        df['Time'] = GetNowTime()
        df = df.sort_index(by='Speed')

        now_data = GetNowDate()


        file_name = self.dir_path +'ip_proxy_' + now_data + '.xlsx'

        df.to_excel(file_name)

生成的excel檔案如下：

IP代理池的Python實作

IP代理池的Python實作

一、解析網頁中的IP和端口

二、Ping所有IP位址的連接配接速度

三、按速度從快到慢排序，儲存至檔案

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入