通過python的selenium爬取智塔鍊庫的公司資料_20200526_

2023-08-05 12:27:30

通過python的selenium爬取智塔鍊庫的公司資料。

本文遇到了典型的常見的網頁源碼為代碼的問題，這種網頁源碼不可直接讀取資訊，但是可以操縱浏覽器将代碼渲染為有價值資訊，再通過xpath和html.etree讀取資訊。

尤其需要注意的是，一定要在讀取打開網頁後設定一定休眠時間再讀取網頁源碼，否則讀取到的資料可能為空值，如：

self.browser.get(self.url) # 進行通路

time.sleep(2) # 延遲5秒，此語句非常重要

from selenium import webdriver  # 導入selenium自動化測試子產品
from lxml import html  # 導入xpath解析子產品
import time # 導入時間子產品
etree = html.etree # 執行個體化etree
from selenium.webdriver.chrome.options import Options
import pandas as pd
import os

class BlockData():
    """爬蟲連塔智庫"""
    def __init__(self):
        os.chdir('F:\python工作環境\python\代碼運作檔案夾\學習筆記\網絡爬蟲')
        ch_op = Options()# 建立一個參數對象，用來控制chrome以無界面模式打開
        ch_op.add_argument('--headless')# 設定谷歌浏覽器的頁面無可視化
        ch_op.add_argument('--disable-gpu')
        ch_op.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖檔, 提升速度
        self.browser = webdriver.Chrome(chrome_options=ch_op)  # 建立chrome浏覽器控制器,
        self.browser.implicitly_wait(10)  # 隐式等待:在查找所有元素時，如果尚未被加載，則等10秒
        self.info_s = []  # 公司資訊
        self.error = 0  # 出錯次數
        self.base_url = 'http://www.blockdata.club/site/company?page='  # 擷取要通路的url

    def get_html(self):
        """獲得單個網頁資料"""
        self.browser.get(self.url)  # 進行通路
        time.sleep(2)  # 延遲5秒，此語句非常重要，務必将此語句放置于browser.get(self.url)和text_s = browser.page_source之間
        #否則有可能由于網頁反應速度慢，造成讀取空值
        text_s = self.browser.page_source  # 将擷取的頁面轉化成text類型,
        # print(text_s)此時仍然是純粹代碼文本，不包含資訊
        self.text_s=text_s

    def get_info(self):
        """解析網頁并從其中獲得資料"""
        # 通過xpath将代碼文本轉化為包含資訊的字元串
        tree = etree.HTML(self.text_s)  # 執行個體化xpath
        uls = tree.xpath('//div[@id="enterpriseList"]/ul')  # 進行第一次解析，擷取

        # 将單頁所有公司資訊收集
        for ul in uls:
            try:
                company = ul.xpath('a/li[@class="enterprise_One"]//text()')[0]  # 公司名稱，使用a/li//text()可以直接擷取li中的文本
                location = ul.xpath('a/li[@class="enterprise_Two"]//text()')[0]  # 所在省份
                commander = ul.xpath('a/li[@class="enterprise_Ser"]//text()')[0]  # 經理
                money = ul.xpath('a/li[@class="enterprise_Four"]//text()')[0]  # 注冊資本
                date = ul.xpath('a/li[@class="enterprise_Five"]//text()')[0]  # 成立日期
                type = ul.xpath('a/li[@class="enterprise_Six"]//text()')[0]  # 公司性質
                self.info_s.append([company, location, commander, money, date, type])
            except:
                self.error += 1
                continue

    def spec_pages(self,pages):
        """指定要爬取的頁數，輸入頁數"""
        for page in range(1,pages+1):
            try:
                self.url=self.base_url+str(page)
                self.get_html()
                self.get_info()
                print('已經完成'+str(round(page/pages,2)*100)+'%')
            except:
                continue

    def to_excel(self):
        """将爬取的資料儲存至EXCEL"""
        data=pd.DataFrame(self.info_s)
        data.to_excel('data.xlsx')

if __name__=='__main__':
    block_data=BlockData()
    block_data.spec_pages(100)
    block_data.to_excel()

通過python的selenium爬取智塔鍊庫的公司資料_20200526_

繼續閱讀

Python爬蟲之網站超清圖檔爬取(2021.3.29)

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰