Scrapy实例2_腾讯招聘

爬取腾讯招聘数据https://careers.tencent.com/search.html,保存为json文件

分析网页

右键查看网页源代码发现网页主体内容是Query动态加载的数据,所以我们需要抓包

点击Network查看,发现第一页数据都在这个链接下https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1601278633129&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

Scrapy实例2_腾讯招聘

数据找到了就好办了,然后分析翻页时这个url的变化,每一页不同的是timestamp(时间戳)和pageIndex(页数),我们可以使用fomat和for语句获取每一页的数据

Scrapy实例2_腾讯招聘

创建Scrapy项目

打开Pycharm的Terminal输入

scrapy startproject tencent

进入scrapy项目目录

cd tencent

编写spider

创建爬虫文件 `scrapy genspider recruit tencent.com`

Scrapy实例2_腾讯招聘

编写爬虫代码

打开刚刚创建的recruit.py

修改start_urls

编写parse方法

def parse(self, response):
    """
    数据解析
    :param response: 响应数据
    """
    result_data = json.loads(response.text)
    result = result_data['Data']['Posts']
    for temp in result:
        # 岗位名称
        name = temp['RecruitPostName'].strip()
        # 地址
        address = temp['LocationName'].strip()
        # 工作职责
        sibility = temp['Responsibility'].strip()
        item = TencentItem()
        item['name'] = name
        item['duty'] = sibility
        item['address'] = address
        yield item

完整代码如下

import json
import scrapy
import time
from ..items import TencentItem


class RecruitSpider(scrapy.Spider):
    name = 'recruit'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time() * 1000), index) for index in range(1, 638)]

    def parse(self, response):
        """
        数据解析
        :param response: 响应数据
        """
        result_data = json.loads(response.text)
        result = result_data['Data']['Posts']
        for temp in result:
            # 岗位名称
            name = temp['RecruitPostName'].strip()
            # 地址
            address = temp['LocationName'].strip()
            # 工作职责
            sibility = temp['Responsibility'].strip()
            item = TencentItem()
            item['name'] = name
            item['duty'] = sibility
            item['address'] = address
            yield item

编写items

import scrapy


class TencentItem(scrapy.Item):
    
    name = scrapy.Field()
    duty = scrapy.Field()
    address = scrapy.Field()

编写pipelines

import json



class TencentPipeline:
    def process_item(self, item, spider):
        self.fp = open('tencent.json', 'a', encoding='utf-8')
        json.dump(dict(item), self.fp, ensure_ascii=False)  # ensure_ascii=False防止中文输出ASCII字符码
        return item

    def close_pider(self):
        self.fp.close()

修改settings

找到ITEM_PIPELINES打开注释

Scrapy实例2_腾讯招聘

运行scrapy项目

scrapy crawl recruit

Scrapy实例2_腾讯招聘

json文件保存项目目录下

大家有什么问题欢迎留言!

Scrapy实例2_腾讯招聘

分析网页

创建Scrapy项目

编写spider

创建爬虫文件 `scrapy genspider recruit tencent.com`

编写爬虫代码

编写items

修改settings

运行scrapy项目

继续阅读

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

vue-cli简介（中文翻译）

Ajax发送和获取json数据到Spring mvc 1.spring mvc后端2.web前段

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

JSONObject包导入异常 java.lang.NoClassDefFoundErrorweb项目的导入包的问题

在python中创建excel并写入

Scrapy实例2_腾讯招聘

分析网页

创建Scrapy项目

编写spider

创建爬虫文件 scrapy genspider recruit tencent.com

编写爬虫代码

编写items

修改settings

运行scrapy项目

继续阅读

创建爬虫文件 `scrapy genspider recruit tencent.com`