天天看点

Scrapy实例2_腾讯招聘

爬取腾讯招聘数据https://careers.tencent.com/search.html,保存为json文件

分析网页

右键查看网页源代码发现网页主体内容是Query动态加载的数据,所以我们需要抓包

点击Network查看,发现第一页数据都在这个链接下https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1601278633129&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

Scrapy实例2_腾讯招聘

数据找到了就好办了,然后分析翻页时这个url的变化,每一页不同的是timestamp(时间戳)和pageIndex(页数),我们可以使用fomat和for语句获取每一页的数据

Scrapy实例2_腾讯招聘

创建Scrapy项目

打开Pycharm的Terminal输入

scrapy startproject tencent

进入scrapy项目目录

cd tencent

编写spider

创建爬虫文件

scrapy genspider recruit tencent.com

Scrapy实例2_腾讯招聘

编写爬虫代码

打开刚刚创建的recruit.py

修改start_urls

编写parse方法

def parse(self, response):
    """
    数据解析
    :param response: 响应数据
    """
    result_data = json.loads(response.text)
    result = result_data['Data']['Posts']
    for temp in result:
        # 岗位名称
        name = temp['RecruitPostName'].strip()
        # 地址
        address = temp['LocationName'].strip()
        # 工作职责
        sibility = temp['Responsibility'].strip()
        item = TencentItem()
        item['name'] = name
        item['duty'] = sibility
        item['address'] = address
        yield item
           

完整代码如下

import json
import scrapy
import time
from ..items import TencentItem


class RecruitSpider(scrapy.Spider):
    name = 'recruit'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time() * 1000), index) for index in range(1, 638)]

    def parse(self, response):
        """
        数据解析
        :param response: 响应数据
        """
        result_data = json.loads(response.text)
        result = result_data['Data']['Posts']
        for temp in result:
            # 岗位名称
            name = temp['RecruitPostName'].strip()
            # 地址
            address = temp['LocationName'].strip()
            # 工作职责
            sibility = temp['Responsibility'].strip()
            item = TencentItem()
            item['name'] = name
            item['duty'] = sibility
            item['address'] = address
            yield item
           

编写items

import scrapy


class TencentItem(scrapy.Item):
    
    name = scrapy.Field()
    duty = scrapy.Field()
    address = scrapy.Field()
           

编写pipelines

import json



class TencentPipeline:
    def process_item(self, item, spider):
        self.fp = open('tencent.json', 'a', encoding='utf-8')
        json.dump(dict(item), self.fp, ensure_ascii=False)  # ensure_ascii=False防止中文输出ASCII字符码
        return item

    def close_pider(self):
        self.fp.close()
           

修改settings

找到ITEM_PIPELINES打开注释

Scrapy实例2_腾讯招聘

运行scrapy项目

scrapy crawl recruit

Scrapy实例2_腾讯招聘

json文件保存项目目录下

大家有什么问题欢迎留言!