scrapy_å¸¸ç¨å¿«æ¥

çæé¡¹ç®
ä¿®æ¹åè®®
è®¾ç½®è¯·æ±åè§£æ
è®¾ç½®headers
è®¾ç½®ä»£ç
è®¾ç½®ç®¡é
å¯å¨é¡¹ç®
æé è¯·æ±
æé ååº
å¶ä»å¸¸ç¨å½æ°

çæé¡¹ç®

scrapy startproject <project_name> #çæé¡¹ç®æä»¶

scrapy genspider mySpider 163.com  #çæåºæ¬spideræ¨¡æ¿

scrapy genspider -l #æ¾ç¤ºspideræ¨¡æ¿åè¡¨

scrapy genspider -d template #é¢è§æ¨¡æ¿æ ¼å¼

scrapy genspider [-t template] <name> <domain> #æå®nameåç½åçæspideræä»¶ï¼å¯éæå®æ¨¡æ¿

ä¿®æ¹åè®®

settings

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

è®¾ç½®è¯·æ±åè§£æ

spider

# -*- coding: utf-8 -*-
import scrapy
class mySpider(scrapy.Spider):
    name = 'mySpider'
    allowed_domains = ['163.com'] # éå¶ä¸»ç«åå
    start_urls = ['http://163.com/']

    def start_requests(self):
        # å¿é¡»è¿åå¯è¿ä»£ç±»å
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
            
    def make_requests_from_url(self, url):
        return scrapy.Request(url, callback=self.parse, method='GET', encoding='utf-8' , dont_filter=False,errback)
        # return scrapy.FormRequest(url, formdata={}, callback=self.parse)

    def parse(self,response):
        response.text
        response.body.decode(encoding='utf-8')

è®¾ç½®headers

settings

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
  'Accept-Language': 'en',
}

è®¾ç½®ä»£ç

middlewares

class ProxyMiddleware(object):
    def process_request(self,request,spider):
        request.meta['proxy']='http://127.0.0.1:9743'

settings

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'testproject.middlewares.ProxyMiddleware': 543,
}

è®¾ç½®ç®¡é

pipelines

from scrapy.exceptions import DropItem
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

class Pipeline_Format(object):
    def process_item(self,item,spider):
        item=pd.DataFrame([dict(item)])
        return item



class Pipeline_MySql(object):
    def __init__(self,user,password,port,database,charset):
        self.user=user
        self.password=password
        self.port=port
        self.database=database
        self.charset=charset

    # ç¨settingåéåå§åèªèº«å¯¹è±¡
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
        user=crawler.settings.get('MYSQL_USER'),
        password=crawler.settings.get('MYSQL_PASSWORD'),
        port=crawler.settings.get('MYSQL_PORT'),
        database=crawler.settings.get('MYSQL_DATABASE'),
        charset=crawler.settings.get('MYSQL_CHARSET')
        )
        
    # spiderå¼å¯æ¶è°ç¨,æé engineåè¿æ¥æ°æ®åº
    def open_spider(self,spider):
        cracom='mysql+pymysql://{user}:{passwork}@127.0.0.1:{port}/{database}?charset={charset}'
        self.engine=create_engine(cracom.format(
            user=self.user,
            passwork=self.passwork,
            port=self.port,
            database=self.atabase,
            charset=self.charset)
            )
        self.session=sessionmaker(bind=self.engine)()
        
    # spiderå³éæ¶è°ç¨ï¼æå¼æ°æ®åº
    def close_spider(self,spider):
        self.session.close()

    # å¤çitemï¼æitemåå¥æ°æ®åºå¹¶è¿åitem
    def process_item(self,item,spider):
        item.to_sql('tbname',con=self.engine,if_exists='append',index=False)
        return item

settings

# å¼å¯ç®¡éï¼æ°å¼è¶å°è¶ä¼åå¤ç
ITEM_PIPELINES = {
    'testproject.pipelines.Pipeline_Format': 300,
    'testproject.pipelines.Pipeline_MySql': 400ï¼
}

# å®ä¹ç¨äºè¿æ¥æ°æ®åºçåæ°
MYSQL_DATABASE='scrapy_test'
MYSQL_USER='root'
MYSQL_PASSWORD='123456'
MYSQL_PORT=3306
MYSQL_CHARSET='utf8mb4'

å¯å¨é¡¹ç®

scrapy crawl mySpider
scrapy crawl mySpider -o fname.json
-o fname.jl #è¡æ ¼å¼çjson
-o fname.csv
-o fname.htp://url/path/file_name.csv] #ä¸ä¼ å°ç½ç»

æé è¯·æ±

class scrapy.http.Request() url (string) â the URL of this request callback (callable) â the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesnât specify a callback, the spiderâs parse() method will be used. Note that if exceptions are raised during processing, errback is called instead. method (string) â the HTTP method of this request. Defaults to âGETâ. meta (dict) â the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied. body (str or unicode) â the request body. If a unicode is passed, then itâs encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None). headers (dict) â the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all. cookies (dict or list) âthe request cookies. These can be sent in two forms.
scrapy.FormRequest(url, formdata={}, callback=self.parse [,â¦])
scrapy.FormRequest.from_response(url, formdata={}, callback=self.parse [,â¦])

æé ååº

class scrapy.http.Response() url (string) â the URL of this response headers (dict) â the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). status (integer) â the HTTP status of the response. Defaults to 200. body (str) â the response body. It must be str, not unicode, unless youâre using a encoding-aware Response subclass, such as TextResponse. meta (dict) â the initial values for the Response.meta attribute. If given, the dict will be shallow copied. flags (list) â is a list containing the initial values for the Response.flags attribute. If given, the list will be shallow copied.

å¶ä»å¸¸ç¨å½æ°

response.body #è¿åäºè¿å¶ææ¬
response.text #è¿åå¯è¯»ææ¬
response.urljoin(href) #è¿åç»å¯¹å°å

éæ©å¨ä¾å

# éæ©å¨
sel.extract(default='') #è¿ååè¡¨ï¼æåæ ç¾åå®¹ï¼æ²¡æåå®¹è¿å''
sel.extract_frist() #è¿åé¦ä¸ª
sel.re('(.*)') #è¿ååè¡¨ï¼æå()åå¹éçåå®¹
sel.re_first('(.*)') #è¿åé¦ä¸ª

#ï¼http://doc.scrapy.org/en/latest/_static/selectors-sample1.htmlï¼

#æ ¹æ®æ ç¾å®ä½
res.xpath('//div/a')
res.css('div a')

#æ ¹æ®å±æ§å¼å®ä½
res.xpaht('//div[@id="images"]/a)
res.css('div[id=images] a')

#æ ¹æ®å±æ§å¼åå«åå®¹å®ä½
res.xpath('//a[contains(@href,"image")]/img')
res.css('a[href*=image] img')

#å®ä½æ ç¾åææ¬åå®¹
res.xpath('//title/text()') 
res.css('title::text') 

#å®ä½è·åå±æ§åå®¹
response.xpath('//a/@href')
response.css('a::attr(href)')

scrapy_常用快查生成项目修改协议设置请求和解析设置headers设置代理设置管道启动项目构造请求构造响应其他常用函数

scrapy_å¸¸ç¨å¿«æ¥

çæé¡¹ç®

ä¿®æ¹åè®®

è®¾ç½®è¯·æ±åè§£æ

è®¾ç½®headers

è®¾ç½®ä»£ç

è®¾ç½®ç®¡é

å¯å¨é¡¹ç®

æé è¯·æ±

æé ååº

å¶ä»å¸¸ç¨å½æ°

继续阅读

【python_爬虫】【代理池辅助接口】连接爬虫部分代码和IP代理池的辅助桥梁模块

【爬虫基础】day01 学习get传参

scrapy_常用快查生成项目修改协议设置请求和解析设置headers设置代理设置管道启动项目构造请求构造响应其他常用函数

scrapy_å¸¸ç¨å¿«æ¥

çæé¡¹ç®

ä¿®æ¹åè®®

è®¾ç½®è¯·æ±åè§£æ

è®¾ç½®headers

è®¾ç½®ä»£ç

è®¾ç½®ç®¡é

å¯å¨é¡¹ç®

æé è¯·æ±

æé ååº

å ¶ä»å¸¸ç¨å½æ°

继续阅读

scrapy_å¸¸ç¨å¿«æ¥

çæé¡¹ç®

ä¿®æ¹åè®®

è®¾ç½®è¯·æ±åè§£æ

è®¾ç½®ä»£ç

è®¾ç½®ç®¡é

å¯å¨é¡¹ç®

æé è¯·æ±

æé ååº

å¶ä»å¸¸ç¨å½æ°