scrapy_常ç¨å¿«æ¥
- çæé¡¹ç®
- ä¿®æ¹åè®®
- 设置请æ±åè§£æ
- 设置headers
- 设置代ç
- 设置管é
- å¯å¨é¡¹ç®
- æé 请æ±
- æé ååº
- å ¶ä»å¸¸ç¨å½æ°
çæé¡¹ç®
scrapy startproject <project_name> #çæé¡¹ç®æä»¶
scrapy genspider mySpider 163.com #çæåºæ¬spider模æ¿
scrapy genspider -l #æ¾ç¤ºspider模æ¿å表
scrapy genspider -d template #é¢è§æ¨¡æ¿æ ¼å¼
scrapy genspider [-t template] <name> <domain> #æå®nameåç½åçæspideræä»¶ï¼å¯éæå®æ¨¡æ¿
ä¿®æ¹åè®®
- settings
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
设置请æ±åè§£æ
- spider
# -*- coding: utf-8 -*-
import scrapy
class mySpider(scrapy.Spider):
name = 'mySpider'
allowed_domains = ['163.com'] # éå¶ä¸»ç«åå
start_urls = ['http://163.com/']
def start_requests(self):
# å¿
é¡»è¿åå¯è¿ä»£ç±»å
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return scrapy.Request(url, callback=self.parse, method='GET', encoding='utf-8' , dont_filter=False,errback)
# return scrapy.FormRequest(url, formdata={}, callback=self.parse)
def parse(self,response):
response.text
response.body.decode(encoding='utf-8')
设置headers
- settings
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Accept-Language': 'en',
}
设置代ç
- middlewares
class ProxyMiddleware(object):
def process_request(self,request,spider):
request.meta['proxy']='http://127.0.0.1:9743'
- settings
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'testproject.middlewares.ProxyMiddleware': 543,
}
设置管é
- pipelines
from scrapy.exceptions import DropItem
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
class Pipeline_Format(object):
def process_item(self,item,spider):
item=pd.DataFrame([dict(item)])
return item
class Pipeline_MySql(object):
def __init__(self,user,password,port,database,charset):
self.user=user
self.password=password
self.port=port
self.database=database
self.charset=charset
# ç¨settingåéåå§åèªèº«å¯¹è±¡
@classmethod
def from_crawler(cls,crawler):
return cls(
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
database=crawler.settings.get('MYSQL_DATABASE'),
charset=crawler.settings.get('MYSQL_CHARSET')
)
# spiderå¼å¯æ¶è°ç¨,æé engineåè¿æ¥æ°æ®åº
def open_spider(self,spider):
cracom='mysql+pymysql://{user}:{passwork}@127.0.0.1:{port}/{database}?charset={charset}'
self.engine=create_engine(cracom.format(
user=self.user,
passwork=self.passwork,
port=self.port,
database=self.atabase,
charset=self.charset)
)
self.session=sessionmaker(bind=self.engine)()
# spiderå
³éæ¶è°ç¨ï¼æå¼æ°æ®åº
def close_spider(self,spider):
self.session.close()
# å¤çitemï¼æitemåå
¥æ°æ®åºå¹¶è¿åitem
def process_item(self,item,spider):
item.to_sql('tbname',con=self.engine,if_exists='append',index=False)
return item
- settings
# å¼å¯ç®¡éï¼æ°å¼è¶å°è¶ä¼å
å¤ç
ITEM_PIPELINES = {
'testproject.pipelines.Pipeline_Format': 300,
'testproject.pipelines.Pipeline_MySql': 400ï¼
}
# å®ä¹ç¨äºè¿æ¥æ°æ®åºçåæ°
MYSQL_DATABASE='scrapy_test'
MYSQL_USER='root'
MYSQL_PASSWORD='123456'
MYSQL_PORT=3306
MYSQL_CHARSET='utf8mb4'
å¯å¨é¡¹ç®
scrapy crawl mySpider
scrapy crawl mySpider -o fname.json
-o fname.jl #è¡æ ¼å¼çjson
-o fname.csv
-o fname.htp://url/path/file_name.csv] #ä¸ä¼ å°ç½ç»
æé 请æ±
- class scrapy.http.Request()
â the URL of this requesturl (string)
â the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesnât specify a callback, the spiderâs parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.callback (callable)
â the HTTP method of this request. Defaults to âGETâ.method (string)
â the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.meta (dict)
â the request body. If a unicode is passed, then itâs encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None).body (str or unicode)
â the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all.headers (dict)
âthe request cookies. These can be sent in two forms.cookies (dict or list)
- scrapy.FormRequest(url, formdata={}, callback=self.parse [,â¦])
- scrapy.FormRequest.from_response(url, formdata={}, callback=self.parse [,â¦])
æé ååº
- class scrapy.http.Response()
â the URL of this responseurl (string)
â the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).headers (dict)
â the HTTP status of the response. Defaults to 200.status (integer)
â the response body. It must be str, not unicode, unless youâre using a encoding-aware Response subclass, such as TextResponse.body (str)
â the initial values for the Response.meta attribute. If given, the dict will be shallow copied.meta (dict)
â is a list containing the initial values for the Response.flags attribute. If given, the list will be shallow copied.flags (list)
å ¶ä»å¸¸ç¨å½æ°
response.body #è¿åäºè¿å¶ææ¬
response.text #è¿åå¯è¯»ææ¬
response.urljoin(href) #è¿åç»å¯¹å°å
- éæ©å¨ä¾å
# éæ©å¨
sel.extract(default='') #è¿ååè¡¨ï¼æåæ ç¾å
å®¹ï¼æ²¡æå
容è¿å''
sel.extract_frist() #è¿åé¦ä¸ª
sel.re('(.*)') #è¿ååè¡¨ï¼æå()å
å¹é
çå
容
sel.re_first('(.*)') #è¿åé¦ä¸ª
#ï¼http://doc.scrapy.org/en/latest/_static/selectors-sample1.htmlï¼
#æ ¹æ®æ ç¾å®ä½
res.xpath('//div/a')
res.css('div a')
#æ ¹æ®å±æ§å¼å®ä½
res.xpaht('//div[@id="images"]/a)
res.css('div[id=images] a')
#æ ¹æ®å±æ§å¼å
å«å
容å®ä½
res.xpath('//a[contains(@href,"image")]/img')
res.css('a[href*=image] img')
#å®ä½æ ç¾å
ææ¬å
容
res.xpath('//title/text()')
res.css('title::text')
#å®ä½è·å屿§å
容
response.xpath('//a/@href')
response.css('a::attr(href)')