Python selenium自動化網頁抓取器

（開開心心每一天~ ---蟲瘾師）

直接入正題---Python selenium自動控制浏覽器對網頁的資料進行抓取，其中包含按鈕點選、跳轉頁面、搜尋框的輸入、頁面的價值資料存儲、mongodb自動id辨別等等等。

1、首先介紹一下 Python selenium ---自動化測試工具，用來控制浏覽器來對網頁的操作，在爬蟲中與BeautifulSoup結合那就是天衣無縫，除去國外的一些變态的驗證網頁，對于圖檔驗證碼我有自己寫的破解圖檔驗證碼的源代碼，成功率在85%。

詳情請咨詢QQ群--607021567（這不算廣告，群裡有好多Python的資源分享，還有大資料的一些知識【hadoop】）

2、beautifulsoup就不需要詳細的介紹了，直接上網址-

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

（BeautifulSoup的官方文檔）

3、關于mongodb的自動id的生成。mongodb中所有的存儲資料都是有固定的id的，但是mongodb的id對于人類來講是複雜的，對于機器來講是小菜一碟的，是以在存入資料的同時，我習慣用新id來對每一條資料的負責！

在Python中使用mongodb的話需要引進子產品 from pymongo import MongoClient,ASCENDING, DESCENDING ，該子產品就是你的責任！

接下來開始講程式，直接上執行個體（一步一步來）：

引入子產品：

1 from selenium import webdriver
2 from bs4 import BeautifulSoup
3 import requests
4 from pymongo import MongoClient,ASCENDING, DESCENDING
5 import time
6 import re

其中的每一個子產品都會說已經解釋過了，其中的re、requests都是之前就有提過的，他們都是核心缺一不可！

首先，我舉一個小例子，淘寶的自動模拟搜尋功能（源碼）：

先說一下selenium 的定位方法

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

源碼：

1 from selenium import webdriver
 2 from bs4 import BeautifulSoup
 3 import requests
 4 from pymongo import MongoClient,ASCENDING, DESCENDING
 5 import time
 6 import re
 7 
 8 def TaoBao():
 9     try:
10         Taobaourl = 'https://www.taobao.com/'
11         driver = webdriver.Chrome()
12         driver.get(Taobaourl)
13         time.sleep(5)#通常這裡需要停頓，不然你的程式很有可能被檢測到是Spider
14         text='Strong Man'#輸入的内容
15         driver.find_element_by_xpath('//input[@class="search-combobox-input"]').send_keys(text).click()
16         driver.find_element_by_xpath('//button[@class="btn-search tb-bg"]').click()
17         driver.quit()
18         
19     except Exception,e:
20         print e
21 
22 if __name__ == '__main__':
23     TaoBao()

效果的實作，你們可以直接複制後直接運作！我隻用了xpath的這個方法，因為它最實在！橙色字型（如果我沒有色盲的話），就是網頁中定位的元素，可以找到的！

接下來就是與BeautifulSoup的結合了，但是我們看到的隻是打開了網頁，并沒有源碼，那麼就需要 “變量名.page_source”這個方法，他會實作你的夢想，你懂得?

1 ht =  driver.page_source
2 #print ht 你可以Print出啦看看
3 soup = BeautifulSoup(ht,'html.parser')

下面就是BeautifulSoup的一些文法操作了，對于資料的結構還有采集，在上一篇裡面有詳細的抓取操作！！！

算了！說一個最簡單的定位抓取：

1 soup = BeautifulSoup(ht,'html.parser')
2 a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
3 if a:  #必須加判斷，不然通路的網頁或許沒有這一進制素，程式就會都停止！

class的标簽必須是class_,一定要記住！

哈哈哈！mongodb了昂，細節細節，首先需要用到子產品----from pymongo import MongoClient,ASCENDING, DESCENDING

因為在python，mongodb的文法仍然實用，是以需要定義一個庫，并且是全局性的，還有連結你計算機的一個全局變量。

1 if __name__ == '__main__':  
2 
3   global db#全局變量                   
4   global table#全局資料庫
5   table = 'mouser_product'
6   mconn=MongoClient("mongodb://localhost")#位址
7   db=mconn.test
8   db.authenticate('test','test')#使用者名和密碼
9   Taobao()

定義這些後，需要我們的新id來對資料的跟蹤加定義：

1 db.sn.find_and_modify({"_id": table}, update={ "$inc": {'currentIdValue': 1}},upsert=True)
2 dic = db.ids.find({"_id":table}).limit(1)
3 return dic[0].get("currentIdValue")

這個方法是通用的，是以隻要記住其中的mongodb的文法就可以了！因為這裡是有傳回值的，是以這個是個方法體，這裡不需要太過于糾結是怎麼實作的，了解就好，中心還是在存資料的過程中

1 count = db[table].find({'資料':資料}).count() #是檢索資料庫中的資料
2 if count <= 0:                               #判斷是否有
3     ids= getNewsn()                          #ids就是我們新定義的id，這裡的id是1開始的增長型id
4     db[table].insert({"ids":ids,"資料":資料})

這樣我們的資料就直接存入到mongodb的資料庫中了，這裡解釋一下為什麼在大資料中這麼喜歡mongodb，因為它小巧，速度佳！

最後來一個執行個體源碼：

1 from selenium import webdriver
 2 from bs4 import BeautifulSoup
 3 import requests
 4 from pymongo import MongoClient,ASCENDING, DESCENDING
 5 import time
 6 import re
 7 def parser():
 8     try:
 9         f = open('sitemap.txt','r')
10         for i in  f.readlines():
11             sorturl=i.strip()
12             driver = webdriver.Firefox()
13             driver.get(sorturl)
14             time.sleep(50)
15             ht =  driver.page_source
16             #pageurl(ht)
17             soup = BeautifulSoup(ht,'html.parser')
18             a = soup.find('a',class_="first-last")
19             if a:
20                 pagenum = int(a.get_text().strip())
21                 print pagenum
22                 for i in xrange(1,pagenum):
23                     element = driver.find_element_by_xpath('//a[@id="ctl00_ContentMain_PagerTop_%s"]' %i)
24                     element.click()
25                     html =  element.page_source
26                     pageurl(html)
27                     time.sleep(50)
28                     driver.quit()
29     except Exception,e:
30         print e
31 def pageurl(ht):
32     try:
33         soup = BeautifulSoup(ht,'html.parser')
34         a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
35         if a:
36             tr = a.find_all('tr',class_="SearchResultsRowOdd")
37             if tr:
38                     for i in tr:
39                         td = i.find_all('td')
40                         if td:
41                             url = td[2].find('a')
42                             if url:
43                                 producturl = '網址'+url['href']
44                                 print producturl
45                                 count = db[table].find({"url":producturl}).count()
46                                 if count<=0:
47                                     sn = getNewsn()
48                                     db[table].insert({"sn":sn,"url":producturl})
49                                     print str(sn) + ' inserted successfully'
50                                     time.sleep(3)
51                                 else:
52                                     print 'exists url'
53             tr1 = a.find_all('tr',class_="SearchResultsRowEven")
54             if tr1:
55                     for i in tr1:
56                         td = i.find_all('td')
57                         if td:
58                             url = td[2].find('a')
59                             if url:
60                                 producturl = '網址'+url['href']
61                                 print producturl
62                                 count = db[table].find({"url":producturl}).count()
63                                 if count<=0:
64                                     sn = getNewsn()
65                                     db[table].insert({"sn":sn,"url":producturl})
66                                     print str(sn) + ' inserted successfully'
67                                     time.sleep(3)
68                                 else:
69                                     print 'exists url'
70                                 #time.sleep(5)
71 
72     except Exception,e:
73         print e
74 def getNewsn(): 
75     db.sn.find_and_modify({"_id": table}, update={ "$inc"{'currentIdValue': 1}},upsert=True)
76     dic = db.sn.find({"_id":table}).limit(1)
77     return dic[0].get("currentIdValue")
78 
79 if __name__ == '__main__':  
80 
81   global db                    
82   global table
83   table = 'mous_product'
84   mconn=MongoClient("mongodb://localhost")
85   db=mconn.test
86   db.authenticate('test','test')
87   parser()

這一串代碼是破解一個老外的無聊驗證碼界面結緣的，我真的對他很無語了！破解方法還是實踐中！這是完整的源碼，無删改的哦！純手工！

Python selenium自動化網頁抓取器

Welcome to Python world! I have a contract in this world! How about you?

Python selenium自動化網頁抓取器

繼續閱讀

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Ambari介紹和架構原理

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

win10本地scala和spark安裝安裝scala安裝spark

在python中建立excel并寫入