天天看點

python抓取招聘資料_Python爬蟲實戰-抓取boss直聘招聘資訊

實戰内容:爬取boss直聘的崗位資訊,存儲在資料庫,最後通過可視化展示出來

PS注意:很多人學Python過程中會遇到各種煩惱問題,沒有人幫答疑容易放棄。為此小編建了個Python全棧免費答疑.裙 :七衣衣九七七巴而五(數字的諧音)轉換下可以找到了,不懂的問題有老司機解決裡面還有最新Python教程項目可拿,,一起互相監督共同進步!

0 環境搭建

MacBook Air (13-inch, 2017)

CPU:1.8 GHz Intel Core i5

RAM:8 GB 1600 MHz DDR3

IDE:anaconda3.6 | jupyter notebook

Python版本:Python 3.6.5 :: Anaconda, Inc.

1 安裝scrapy

過程在參考連結中,我隻說與上面不一緻的地方

pip install scrapy

遇到報錯,無法調用gcc*解決方案:mac自動彈出安裝gcc提示框,點選“安裝”即可

安裝成功,安裝過程中,終端列印出“distributed 1.21.8 requires msgpack, which is not installed.”

解決方案:

conda install -c anaconda msgpack-python

pip install msgpack

2 建立項目

scrapy startproject www_zhipin_com

可以通過 scrapy -h 了解功能

python抓取招聘資料_Python爬蟲實戰-抓取boss直聘招聘資訊

源碼檔案關系

tree這個指令挺好用,微軟cmd中自帶,Python沒有自帶的,可以參考網上代碼,自己寫一個玩玩。

3 定義要抓取的item

與源代碼基本一緻

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class WwwZhipinComItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

pid = scrapy.Field()

positionName = scrapy.Field()

positionLables = scrapy.Field()

city = scrapy.Field()

experience = scrapy.Field()

educational = scrapy.Field()

salary = scrapy.Field()

company = scrapy.Field()

industryField = scrapy.Field()

financeStage = scrapy.Field()

companySize = scrapy.Field()

time = scrapy.Field()

updated_at = scrapy.Field()

4 分析頁面

現在頁面改版了,釋出時間有了小幅度調整

python抓取招聘資料_Python爬蟲實戰-抓取boss直聘招聘資訊

頁面

python抓取招聘資料_Python爬蟲實戰-抓取boss直聘招聘資訊

HTML結構如下

5 爬蟲代碼

這一步有些看不懂,硬着頭皮往下寫,不懂得先記着

5.1關于request headers

比如headers中,我在自己的浏覽器中找不到下面内容x-devtools-emulate-network-conditions-client-id ??postman-token ??

我該學習一下request headers中内容目前采用的方法是把作者的headers拷貝過去,然後我這邊有的我替換掉,沒有的比如x-devtools我就用作者原有的。

5.2 關于extract_first()和extract()

extract_first()和extract()的差別:提取全部内容: .extract(),獲得是一個清單提取第一個:.extract_first(),獲得是一個字元串

Selectors根據CSS表達式從網頁中選擇資料(CSS更常用)response.selector.css('title::text') ##用css選取了title的文字内容由于selector.css使用比較普遍,是以專門定義了css,是以上面也可以寫成:response.css('title::text')

運作腳本,會在項目目錄下生成一個包含爬取資料的item.json檔案

scrapy crawl zhipin -o item.json

debug完最後一個錯誤之後,第五步終于跑通了,截個圖

python抓取招聘資料_Python爬蟲實戰-抓取boss直聘招聘資訊

爬取boss直聘上面關于python的職位

存入json檔案的模樣有點奇怪,沒漢字,第六步應該會解決:

{"pid": "23056497", "positionName": "", "salary": "8k-9k", "city": "\u5317\u4eac", "experience": "\u4e0d\u9650", "educational": "\u672c\u79d1", "company": "\u4eca\u65e5\u5934\u6761", "positionLables": [], "time": "\u53d1\u5e03\u4e8e07\u670812\u65e5", "updated_at": "2018-07-17 00:04:05"},

{"pid": "23066797", "positionName": "", "salary": "18k-25k", "city": "\u5317\u4eac", "experience": "1-3\u5e74", "educational": "\u672c\u79d1", "company": "\u5929\u4e0b\u79c0", "positionLables": [], "time": "\u53d1\u5e03\u4e8e07\u670813\u65e5", "updated_at": "2018-07-17 00:04:05"},

第五步因為網頁發生改版,是以釋出時間time這塊需要修改一下,其他都沒有問題。我也把源碼貼一下:

# 2018-07-17

# Author limingxuan

# [email protected]

# blog:https://www.jianshu.com/p/a5907362ba72

import scrapy

import time

from www_zhipin_com.items import WwwZhipinComItem

class ZhipinSpider(scrapy.Spider):

name = 'zhipin'

allowed_domains = ['www.zhipin.com']

start_urls = ['http://www.zhipin.com/']

positionUrl = 'https://www.zhipin.com/job_detail/?query=python&scity=101010100'

curPage = 1

#我的浏覽器找不到源碼中的一些字段,比如

#x-devtools-emulate-network-conditions-client-id

#upgrade-insecure-requests

#dnt

#cache-control

#postman-token

#是以就沒有加,按我的浏覽器查到的資訊填寫的,現在看起來貌似也能跑起來

headers = {

'accept': "application/json, text/javascript, */*; q=0.01",

'accept-encoding': "gzip, deflate, br",

'accept-language': "zh-CN,zh;q=0.9,en;q=0.8",

'content-type': "application/x-www-form-urlencoded; charset=UTF-8",

'cookie': "JSESSIONID=""; __c=1530137184; sid=sem_pz_bdpc_dasou_title; __g=sem_pz_bdpc_dasou_title; __l=r=https%3A%2F%2Fwww.zhipin.com%2Fgongsi%2F5189f3fadb73e42f1HN40t8~.html&l=%2Fwww.zhipin.com%2Fgongsir%2F5189f3fadb73e42f1HN40t8~.html%3Fka%3Dcompany-jobs&g=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1531150234,1531231870,1531573701,1531741316; lastCity=101010100; toUrl=https%3A%2F%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3Dpython%26scity%3D101010100; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1531743361; __a=26651524.1530136298.1530136298.1530137184.286.2.285.199",

'origin': "https://www.zhipin.com",

'referer': "https://www.zhipin.com/job_detail/?query=python&scity=101010100",

'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

}

def start_requests(self):

return [self.next_request()]

def parse(self,response):

print("request -> " + response.url)

job_list = response.css('div.job-list > ul > li')

for job in job_list:

item = WwwZhipinComItem()

job_primary = job.css('div.job-primary')

item['pid'] = job.css(

'div.info-primary > h3 > a::attr(data-jobid)').extract_first().strip()

#job-title這裡和源碼不同,頁面改版所導緻

item['positionName'] = job_primary.css(

'div.info-primary > h3 > a > div.job-title::text').extract_first().strip()

item['salary'] = job_primary.css(

'div.info-primary > h3 > a > span::text').extract_first().strip()

#提取全部内容: .extract(),獲得是一個清單

#提取第一個:.extract_first(),獲得是一個字元串

info_primary = job_primary.css(

'div.info-primary > p::text').extract()

item['city'] = info_primary[0].strip()

item['experience'] = info_primary[1].strip()

item['educational'] = info_primary[2].strip()

item['company'] = job_primary.css(

'div.info-company > div.company-text > h3 > a::text').extract_first().strip()

company_infos = job_primary.css(

'div.info-company > div.company-text > p::text').extract()

if len(company_infos)== 3:

item['industryField'] = company_infos[0].strip()

item['financeStage'] = company_infos[1].strip(