Python爬linkedin——規模在1001-5000的瑞士公司(非官方API)

1、說明

使用Python爬linkedin網站，爬公司規模在1001-5000的瑞士公司資訊，隻爬公司名、公司規模、所在行業、所在地這四個資訊，且不使用linkedin官方API。

2、步驟說明

1>、模拟登入

2>、爬公司資訊

3>、擷取下一頁的URL

3、代碼片段

1>、模拟登入

從登入頁面填寫form表單，該表單中有一些JS驗證，在JS開啟時，會随着表單一起送出。不過不填也沒事，畢竟不是每個人都會開啟JS，故跳過。

這裡使用BeautifulSoup子產品來提取html頁面中的一些資訊。

def login(s):
    #打開頁面
    r = s.get('https://www.linkedin.com/uas/login')
    
    #html處理
    soup = BeautifulSoup(r.text, "lxml")
    soup = soup.find(id="login")      #查找表單

    #提取表單資訊
    loginCsrfParam = soup.find('input', id = 'loginCsrfParam-login')['value']
    csrfToken = soup.find('input', id = 'csrfToken-login')['value']
    sourceAlias = soup.find('input', id = 'sourceAlias-login')['value']
    isJsEnabled = soup.find('input',attrs={"name" :'isJsEnabled'})['value']
    source_app = soup.find('input', attrs={"name" :'source_app'})['value']
    tryCount = soup.find('input', id = 'tryCount')['value']
    clickedSuggestion = soup.find('input', id = 'clickedSuggestion')['value']
    signin = soup.find('input', attrs={"name" :'signin'})['value']
    session_redirect = soup.find('input', attrs={"name" :'session_redirect'})['value']
    trk = soup.find('input', attrs={"name" :'trk'})['value']
    fromEmail = soup.find('input', attrs={"name" :'fromEmail'})['value']

    #填充表單
    payload = {
        'isJsEnabled':isJsEnabled,
        'source_app':source_app,
        'tryCount':tryCount,
        'clickedSuggestion':clickedSuggestion,
        'session_key':'**********',
        'session_password':'*********',
        'signin':signin,
        'session_redirect':session_redirect,
        'trk':trk,
        'loginCsrfParam':loginCsrfParam,
        'fromEmail':fromEmail,
        'csrfToken':csrfToken,
        'sourceAlias':sourceAlias
    }

    #送出表單
    s.post('https://www.linkedin.com/uas/login-submit', data=payload)
    return s

2>、爬公司資訊

linkedin将公司内容放在注釋裡面，囧！格式類似于下面這樣：

<code id="voltron_srp_main-content" style="display:none;">
     <!--{"content":{"lix_header_lowercase":"control","lix_instant_co......."type":"companies"}},"status":"ok"}-->
</code>

這裡使用正則比對提取内容，然後使用json子產品格式化提取的資訊，最後提取資訊。

def getCompanins(s, start_url):
    r= s.get(start_url)

    #使用正則比對提取内容
    html = r.text.encode("utf-8")
    code = re.search(r'<code id="voltron_srp_main-content" style="display:none;"><!--.+--></code>', html).group()
    code = code.replace(r'<code id="voltron_srp_main-content" style="display:none;"><!--', '')
    code = code.replace(r'--></code>', '')

    #使用json子產品格式化提取的資訊
    code_json = json.loads(code)

    #提取資訊

    company_json = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["results"]    
    company_list = []    
    company_list = []
    for company in company_json:
        company = company["company"]
        name = company["fmt_canonicalName"]
        fmt_industry = company["fmt_industry"]
        fmt_size= company["fmt_size"]
        fmt_location= company["fmt_location"]
        company_list.append("%s\t%s\t%s\t%s\n" % (name, fmt_industry, fmt_size, fmt_location))
    return company_list

3>、擷取下一頁的URL

擷取下一頁的連結，該連結還是在注釋裡面，原理同上步驟2>進行提取。

需要說明的是，目前頁是最後一頁時，是沒有nextpage的，是以這裡需要進行判斷一下，如沒有nextpage，則傳回NULL作為傳回标記。

def getNextPageURL(s ,start_url):
    r= s.get(start_url)

    html = r.text.encode("utf-8")
    code = re.search(r'<code id="voltron_srp_main-content" style="display:none;"><!--.+--></code>', html).group()
    code = code.replace(r'<code id="voltron_srp_main-content" style="display:none;"><!--', '')
    code = code.replace(r'--></code>', '')

    code_json = json.loads(code)
    resultPagination = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["baseData"]["resultPagination"]
    
    #判斷是否有下一頁
    if "nextPage" in resultPagination:
        nextPageURL = "http://www.linkedin.com/" + resultPagination["nextPage"]["pageURL"]
    else:
        nextPageURL = "NULL"

    return nextPageURL

4、完整代碼

這裡是将重複部分合并後的代碼。

import requests
import re
import json
from bs4 import BeautifulSoup, Comment

def login(s):
    r = s.get('https://www.linkedin.com/uas/login')
    soup = BeautifulSoup(r.text, "lxml")
    soup = soup.find(id="login")
    loginCsrfParam = soup.find('input', id = 'loginCsrfParam-login')['value']
    csrfToken = soup.find('input', id = 'csrfToken-login')['value']
    sourceAlias = soup.find('input', id = 'sourceAlias-login')['value']
    isJsEnabled = soup.find('input',attrs={"name" :'isJsEnabled'})['value']
    source_app = soup.find('input', attrs={"name" :'source_app'})['value']
    tryCount = soup.find('input', id = 'tryCount')['value']
    clickedSuggestion = soup.find('input', id = 'clickedSuggestion')['value']
    signin = soup.find('input', attrs={"name" :'signin'})['value']
    session_redirect = soup.find('input', attrs={"name" :'session_redirect'})['value']
    trk = soup.find('input', attrs={"name" :'trk'})['value']
    fromEmail = soup.find('input', attrs={"name" :'fromEmail'})['value']

    payload = {
        'isJsEnabled':isJsEnabled,
        'source_app':source_app,
        'tryCount':tryCount,
        'clickedSuggestion':clickedSuggestion,
        'session_key':'*********',
        'session_password':'*********',
        'signin':signin,
        'session_redirect':session_redirect,
        'trk':trk,
        'loginCsrfParam':loginCsrfParam,
        'fromEmail':fromEmail,
        'csrfToken':csrfToken,
        'sourceAlias':sourceAlias
    }

    s.post('https://www.linkedin.com/uas/login-submit', data=payload)
    return s

def getCompanins(s, start_url):
    r= s.get(start_url)

    html = r.text.encode("utf-8")
    code = re.search(r'<code id="voltron_srp_main-content" style="display:none;"><!--.+--></code>', html).group()
    code = code.replace(r'<code id="voltron_srp_main-content" style="display:none;"><!--', '')
    code = code.replace(r'--></code>', '')

    code_json = json.loads(code)
    company_json = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["results"]

    company_list = []
    for company in company_json:
        company = company["company"]
        name = company["fmt_canonicalName"]
        fmt_industry = company["fmt_industry"]
        fmt_size= company["fmt_size"]
        fmt_location= company["fmt_location"]
        company_list.append("%s\t%s\t%s\t%s\n" % (name, fmt_industry, fmt_size, fmt_location))
    return company_list

def getCompanins(code_json):
    company_json = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["results"]

    company_list = []
    for company in company_json:
        company = company["company"]
        name = company["fmt_canonicalName"]
        fmt_industry = company["fmt_industry"]
        fmt_size= company["fmt_size"]
        fmt_location= company["fmt_location"]
        company_list.append("%s\t%s\t%s\t%s\n" % (name, fmt_industry, fmt_size, fmt_location))
    return company_list

def getNextPageURL(s ,start_url):
    r= s.get(start_url)

    html = r.text.encode("utf-8")
    code = re.search(r'<code id="voltron_srp_main-content" style="display:none;"><!--.+--></code>', html).group()
    code = code.replace(r'<code id="voltron_srp_main-content" style="display:none;"><!--', '')
    code = code.replace(r'--></code>', '')

    code_json = json.loads(code)
    resultPagination = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["baseData"]["resultPagination"]
    if "nextPage" in resultPagination:
        nextPageURL = "http://www.linkedin.com/" + resultPagination["nextPage"]["pageURL"]
    else:
        nextPageURL = "NULL"

    return nextPageURL

def getNextPageURL(code_json):
    resultPagination = code_json["content"]["page"]["voltron_unified_search_json"]["search"]["baseData"]["resultPagination"]
    if "nextPage" in resultPagination:
        nextPageURL = "http://www.linkedin.com/" + resultPagination["nextPage"]["pageURL"]
    else:
        nextPageURL = "NULL"

    return nextPageURL

def search(s ,start_url):
    with open("result", "wb") as of:
        while True:
            if start_url == "NULL":
                break
            r= s.get(start_url)
            html = r.text.encode("utf-8")
            code = re.search(r'<code id="voltron_srp_main-content" style="display:none;"><!--.+--></code>', html).group()
            code = code.replace(r'<code id="voltron_srp_main-content" style="display:none;"><!--', '')
            code = code.replace(r'--></code>', '')
            code_json = json.loads(code)
            start_url = getNextPageURL(code_json)
            company_list = getCompanins(code_json)
            for line in company_list:
                of.write(line)

if __name__ == '__main__':
    s = requests.session()
    s = login(s)
    start_url = r"http://www.linkedin.com/vsearch/c?type=companies&keywords=*&orig=FCTD&rsid=4625091521449736984098&pageKey=oz-winner&trkInfo=tarId%3A1449736114717&search=Search&f_CCR=gb%3A4634&openFacets=N,CCR,JO,CS&f_CS=G"
    search(s ,start_url)

5、結果

name    industry    size    location
Dyson	Electrical/Electronic Manufacturing	1001-5000 employees	Swindon, United Kingdom
National Trust	Nonprofit Organization Management	1001-5000 employees	Swindon, United Kingdom
STFC	Research	1001-5000 employees	Swindon, United Kingdom
Honda of the UK Manufacturing Ltd.	Automotive	1001-5000 employees	Swindon, United Kingdom
NERC	Research	1001-5000 employees	Swindon, United Kingdom
Swindon Borough Council	Government Administration	1001-5000 employees	Swindon, United Kingdom
Smiths News	Wholesale	1001-5000 employees	Swindon, United Kingdom
Defence Academy of the UK	Defense & Space	1001-5000 employees	Swindon, United Kingdom
ARI UK	Automotive	1001-5000 employees	Swindon, United Kingdom
Aster Group UK	Nonprofit Organization Management	1001-5000 employees	Swindon, United Kingdom
RWE Systems UK	Utilities	1001-5000 employees	Swindon, United Kingdom
Great Western Ambulance Service NHS Trust	Hospital & Health Care	1001-5000 employees	Swindon, United Kingdom
EOH (Europe)	Information Technology and Services	1001-5000 employees	Swindon, United Kingdom
Westinghouse Rail Systems Australia	Computer Software	1001-5000 employees	Swindon, United Kingdom
St Johns Marlborough	Education Management	1001-5000 employees	Swindon, United Kingdom

Python爬linkedin——規模在1001-5000的瑞士公司(非官方API)

1、說明

2、步驟說明

3、代碼片段

4、完整代碼

5、結果

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入