æ¯æ¬¡ææ°çä¸è¥¿è¦ç¬æ»æ¯å¿äºæ¥éª¤è¦éæ°ç¾åº¦ä¸éï¼ç°å¨èªå·±è®°å½ä¸ä¸ï¼æ¹ä¾¿ä»¥åçã
æ¥éª¤
å®ä¹ Item
é¦å æè¦ç¬çä¸è¥¿å°è£ æ
Item
ï¼å¨
item.py
éå®ä¹ãè¿æ¥æ¯ä¸ºäºä¹åæ¹ä¾¿å¨
pipelines
éå¤ç
item
ã
import scrapy
class MaterialInfo(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
areaCode = scrapy.Field()
deptName = scrapy.Field()
qlName = scrapy.Field()
qlInnerCode = scrapy.Field()
materialName = scrapy.Field()
materialForm = scrapy.Field()
å建 spider æä»¶
è¿é主è¦å两件äºï¼ä¸ä¸ªæ¯ç»ç¬è«å ¥å£è®©ä»å»è¯·æ±ï¼å¦ä¸ä¸ªæ¯å¯¹ç¬ä¸æ¥ç代ç å°è£ æ
item
ä¼ å°
pipeline
éã
请æ±çæ¶åæä¸ªå°åï¼headeréé¢ï¼ä¸è½æ
Content-Type
ï¼å 为è¿ä¸ªéè¦
scrapy
èªå·±è®¡ç®ï¼ä¸ç¶POSTçæ¶åä¼è¿å400ã
start_requests
æ¯ç¬è«çå ¥å£ï¼å¨è¿éè°ç¨Requestæ¥è¯·æ±ã
parse
鿥æ¶è¿åç»æï¼å¨æ¤å¯ä»¥ææ°é¾æ¥å å ¥éåï¼ä¹å¯ä»¥å°è£
item
ï¼å ·ä½ç
scrapy
ä¼èªå·±å¤æãæè¿éåªéè¦å°è£ ï¼ä¸éè¦æ°é¾æ¥å ¥éã
import scrapy
import traceback
from QlsxCrawl.items import MaterialInfo
import json
from scrapy.http import FormRequest
class QlsxSpider(scrapy.Spider):
name = 'QlsxSpider'
def start_requests(self):
url = ''
with open('code', 'r') as fp:
seq = fp.readlines()
for innerCode in seq:
innerCode = innerCode.strip()
data = {'qlInnerCode': innerCode}
yield scrapy.Request(url, method='POST', body=json.dumps(data), callback=self.parse, meta={'qlInnerCode': innerCode})
# yield FormRequest(url=url, formdata=data, callback=self.parse, headers=self.headers)
def getBaseQlsx(self, js, meta):
baseItem = MaterialInfo()
baseItem['deptName'] = js['basicInfoDTO']['entityName']
baseItem['qlInnerCode'] = meta['qlInnerCode']
baseItem['areaCode'] = js['basicInfoDTO']['adCode']
baseItem['qlName'] = js['basicInfoDTO']['matName']
return baseItem
def parse(self, response):
try:
js = json.loads(response.text)['data']
materials = js['materialDTOs']
for mat in materials:
print(mat)
baseItem = self.getBaseQlsx(js, response.meta)
baseItem['materialName'] = mat['materialName']
if 'materialForm' in mat:
baseItem['materialForm'] = mat['materialForm']
yield baseItem
except Exception as e:
traceback.print_exc()
å¡«å 管é
管éç¨äºå¯¹
item
è¿è¡å¤çï¼æè¿éæ¯æ
item
åå°æ°æ®åºéã
å ·ä½è¿ç¨åå¨
process_item
éã
import scrapy
import pymysql
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class QlsxcrawlPipeline(object):
cursor = None
connect = '1234'
def __init__(self):
self.connect = pymysql.connect(
host='localhost',
port=3306,
db='qlsx_crawl',
user='root',
password='',
charset='utf8mb4',
use_unicode=True)
self.cursor = self.connect.cursor()
def process_item(self, item:scrapy.Item, spider):
item.setdefault('materialForm', -1)
insertWords = 'insert into `material_infos`(`areaCode`, `deptName`, `qlname`, `qlInnerCode`, `materialName`, `materialForm`) values({}, {}, {}, {}, {}, {})'.format(item['areaCode'], repr(item['deptName']), repr(item['qlName']), repr(item['qlInnerCode']), repr(item['materialName']), item['materialForm'])
print(insertWords)
self.cursor.execute(insertWords)
self.connect.commit()
return item
é ç½®é ç½®æä»¶
ä¸»è¦æ¯é ç½®ä¸¤ä¸ªå°æ¹ï¼
- é»è®¤è¯·æ±header
- å¼å¯ç®¡é
- ä¸è½½å»¶è¿
å ·ä½ç代ç å°±ä¸è´´äºã