爬取腾讯招聘数据https://careers.tencent.com/search.html,保存为json文件
分析网页
右键查看网页源代码发现网页主体内容是Query动态加载的数据,所以我们需要抓包
点击Network查看,发现第一页数据都在这个链接下https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1601278633129&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

数据找到了就好办了,然后分析翻页时这个url的变化,每一页不同的是timestamp(时间戳)和pageIndex(页数),我们可以使用fomat和for语句获取每一页的数据
创建Scrapy项目
打开Pycharm的Terminal输入
scrapy startproject tencent
进入scrapy项目目录
cd tencent
编写spider
创建爬虫文件 scrapy genspider recruit tencent.com
scrapy genspider recruit tencent.com
编写爬虫代码
打开刚刚创建的recruit.py
修改start_urls
编写parse方法
def parse(self, response):
"""
数据解析
:param response: 响应数据
"""
result_data = json.loads(response.text)
result = result_data['Data']['Posts']
for temp in result:
# 岗位名称
name = temp['RecruitPostName'].strip()
# 地址
address = temp['LocationName'].strip()
# 工作职责
sibility = temp['Responsibility'].strip()
item = TencentItem()
item['name'] = name
item['duty'] = sibility
item['address'] = address
yield item
完整代码如下
import json
import scrapy
import time
from ..items import TencentItem
class RecruitSpider(scrapy.Spider):
name = 'recruit'
allowed_domains = ['tencent.com']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time() * 1000), index) for index in range(1, 638)]
def parse(self, response):
"""
数据解析
:param response: 响应数据
"""
result_data = json.loads(response.text)
result = result_data['Data']['Posts']
for temp in result:
# 岗位名称
name = temp['RecruitPostName'].strip()
# 地址
address = temp['LocationName'].strip()
# 工作职责
sibility = temp['Responsibility'].strip()
item = TencentItem()
item['name'] = name
item['duty'] = sibility
item['address'] = address
yield item
编写items
import scrapy
class TencentItem(scrapy.Item):
name = scrapy.Field()
duty = scrapy.Field()
address = scrapy.Field()
编写pipelines
import json
class TencentPipeline:
def process_item(self, item, spider):
self.fp = open('tencent.json', 'a', encoding='utf-8')
json.dump(dict(item), self.fp, ensure_ascii=False) # ensure_ascii=False防止中文输出ASCII字符码
return item
def close_pider(self):
self.fp.close()
修改settings
找到ITEM_PIPELINES打开注释
运行scrapy项目
scrapy crawl recruit
json文件保存项目目录下
大家有什么问题欢迎留言!