文章目錄

一、Xpath
- 1、xpath簡介
- 2、xpath文法
二、CSS選擇器
三、爬取伯樂線上——初級
- 1、建立Scrapy項目
- 2、編寫item.py檔案
- 3、編寫spider檔案
- 4、編寫pipelines檔案——儲存在json檔案
- 5、setting檔案設定
- 6、執行程式
四、爬取伯樂線上——進階
- 1、item loader機制
- - （1）思路
  - （2）spider.py
  - （3）item.py檔案
- 2、pipelines檔案
- - （1）相關環境安裝（MySQL、Navicat）
  - （2）儲存到MySQL（同步機制）
  - （3）儲存到MySQL（異步機制）

Scrapy相關基本介紹參考這裡

一、Xpath

1、xpath簡介

xpath使用路表達式在xml和html中進行導航
xpath包含标準函數庫
xpath是一個W3C的标準

xpath節點關系

父節點、子節點、同胞節點、先輩節點、後代節點。

2、xpath文法

表達式	說明
article	選取所有article元素的所有子節點
/article	選取根元素article
article/a	選取所有屬于article的子元素的a元素
//div	選取所有div子元素（無論出現文檔任何地方）
article//div	選取所有屬于article元素的後代的div元素，不管它出現在article之下的任何位置
//@class	選取所有名為class的屬性
/article/div[1]	選取屬于article子元素的第一個div元素
/article/div[last()]	選取屬于article子元素的最後一個div元素
/article/div[last()-1]	選取屬于article子元素的倒數第二個div元素
//div[@lang]	選取所有擁有lang屬性的div元素
//div[@ //div/p	選取所有div元素的a和p元素
//span \| //ul	選取文檔中的span元素和ul元素
article/div/p \| //span	選取所有屬于article元素的div元素的p元素以及所有的span元素

二、CSS選擇器

表達式	說明
*	選擇所有節點
#container	選擇id為container的節點
.container	選擇所有class包含container的節點
li a	選擇所有li下的所有a節點
ul + p	選擇ul後面（兄弟節點）的第一個p元素
div#container > ul	選擇id為container的div的第一個ul子節點
ul ~ p	選擇與ul相鄰的所有p元素
a[title]	選擇所有有title屬性的a元素
a[href=“http://jobbole.com”]	選擇所有href屬性為jobbole.com值的a元素
a[href*=“jobole”]	選擇所有href屬性包含jobbole的a元素
a[href^=“http”]	選擇所有href屬性值以http開頭的a元素
a[href$=".jpg"]	選擇所有href屬性值以.jpg結尾的a元素
input[type=radio]:checked	選擇選中的radio的元素
div:not(#container)	選取所有id非container的div屬性
li:nth-child(3)	選取第三個li元素
tr:nth-child(2n)	第偶數個tr

三、爬取伯樂線上——初級

一般的爬蟲步驟：

建立項目 ( scrapy startproject xxx )：建立一個新的爬蟲項目
明确目标（編寫 items.py ）：定義提取的結構化資料
制作爬蟲（ spiders/xxspider.py ）：制作爬蟲開始爬取網頁，提取出結構化資料
存儲内容（ pipelines.py ）：設計管道存儲爬取内容

目标任務：爬取伯樂線上所有技術文檔，需要爬取的内容為：标題、建立時間、網站、網站id、文章封面圖url、文章封面圖路徑、收藏數、點贊數、評論數、全文、标簽

1、建立Scrapy項目

scrapy startproject Article
cd Article

2、編寫item.py檔案

根據需要爬取的内容定義爬取字段，因為需要爬取的内容為：标題、建立時間、網站、網站id、文章封面圖url、文章封面圖路徑、收藏數、點贊數、評論數、全文、标簽。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TestarticleItem(scrapy.Item):
    title = scrapy.Field()      		# 标題
    time = scrapy.Field()				# 建立時間
    url = scrapy.Field()				# 網址
    url_object_id = scrapy.Field() 		# 網址id（使用MD5方法）
    front_image_url = scrapy.Field()	# 文章封面圖url
    front_image_path = scrapy.Field()	# 文章封面圖路徑
    coll_nums = scrapy.Field()			# 收藏數
    comment_nums = scrapy.Field()		# 評論數
    fav_nums = scrapy.Field()			# 點贊數
    content = scrapy.Field()			# 全文
    tags = scrapy.Field()				# 标簽

3、編寫spider檔案

使用指令建立一個基礎爬蟲類：

其中，jobbole為爬蟲名，blog.jobbole.com為爬蟲作用範圍。

執行指令後會在 Article/spiders 檔案夾中建立一個jobbole.py的檔案，現在開始對其編寫，該部分分别用xpath方法和css方法進行編寫。

# -*- coding: utf-8 -*-
import re
import scrapy
import datetime
from scrapy.http import Request
from urllib import parse

from ArticleSpider.items import ArticleItem
from ArticleSpider.utils.common import get_md5

class JobboleSpider(scrapy.Spider):
    name = "jobbole"
    allowed_domains = ["python.jobbole.com"]
    start_urls = ['http://python.jobbole.com/all-posts/']

    def parse(self, response):
        """
        1. 擷取文章清單頁中的文章url并交給scrapy下載下傳後并進行解析
        2. 擷取下一頁的url并交給scrapy進行下載下傳， 下載下傳完成後交給parse
        """

        # 解析清單頁中的所有文章url并交給scrapy下載下傳後并進行解析
        post_nodes = response.css("#archive .floated-thumb .post-thumb a")
        for post_node in post_nodes:
            image_url = post_node.css("img::attr(src)").extract_first("")
            post_url = post_node.css("::attr(href)").extract_first("")
            yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url":image_url}, callback=self.parse_detail_xpath)
            

        # 提取下一頁并交給scrapy進行下載下傳
        next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

    def parse_detail_xpath(self, response):
        article_item = TestarticleItem()

        # 提取文章具體字段
        front_image_url = response.meta.get("front_image_url","")
        title = response.xpath('//div[@class="entry-header"]/h1/text()').extract()[0]
        time = response.xpath('//div[@class="entry-meta"]/p/text()').extract()[0].strip().replace("·","").strip()
        fav_nums = response.xpath('//div[@class="post-adds"]/span[1]/h10/text()').extract()[0]
        coll_nums = response.xpath('//div[@class="post-adds"]/span[2]/text()').extract()[0]
        match_re = re.match(".*(\d+).*", coll_nums)
        if match_re:
            coll_nums = match_re.group(1)
        else:
            coll_nums = 0
        comment_nums = response.xpath('//div[@class="post-adds"]/a[@href="#article-comment" target="_blank" rel="external nofollow" ]/span/text()').extract()[0]
        match_re = re.match(".*(\d+).*", comment_nums)
        if match_re:
            comment_nums = match_re.group(1)
        else:
            comment_nums = 0

        content = response.xpath('//div[@class="entry"]').extract()[0]
        tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)

        article_item['title'] = title
        article_item['url'] = response.url
        article_item['url_object_id'] = get_md5(response.url)
        try:
            time = datetime.datetime.strptime(time,'%Y%m%d').date()
        except Exception as e:
            time = datetime.datetime.now().date()
        article_item['time'] = time
        article_item['front_image_url'] = [front_image_url]
        article_item['fav_nums'] = fav_nums
        article_item['coll_nums'] = coll_nums
        article_item['comment_nums'] = comment_nums
        article_item['tags'] = tags
        article_item['content'] = content

        yield article_item


    def parse_detail_css(self, response):
        article_item =  TestarticleItem()
        # 通過css選擇器提取字段
        front_image_url = response.meta.get("front_image_url", "")       # 文章封面圖
        title = response.css(".entry-header h1::text").extract()[0]
        time = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()
        coll_nums = response.css(".vote-post-up h10::text").extract()[0]
        fav_nums = response.css(".bookmark-btn::text").extract()[0]
        match_re = re.match(".*?(\d+).*", fav_nums)
        if match_re:
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0

        comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0]
        match_re = re.match(".*?(\d+).*", comment_nums)
        if match_re:
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0

        content = response.css("div.entry").extract()[0]

        tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract()
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)

        article_item["url_object_id"] = get_md5(response.url)
        article_item["title"] = title
        article_item["url"] = response.url
        try:
            time = datetime.datetime.strptime(time, "%Y/%m/%d").date()
        except Exception as e:
            time = datetime.datetime.now().date()
        article_item["time"] = time
        article_item["front_image_url"] = [front_image_url]
        article_item["coll_nums"] = coll_nums
        article_item["comment_nums"] = comment_nums
        article_item["fav_nums"] = fav_nums
        article_item["tags"] = tags
        article_item["content"] = content
        yield article_item

在 Aticle 目錄下建立

utils/common.py

用于定義一些共有的函數。

# -*- coding: utf-8 -*-
import hashlib

def get_md5(url):
	if isinstance(url, str):
		url = url.encode('utf-8')
	m = hashlib.md5()
	m.update(url)
	return m.hexdigest()

4、編寫pipelines檔案——儲存在json檔案

儲存為json檔案

利用 json 方式
利用 scrapy 中的 JsonItemExporter 方式

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonItemExporter
import codecs
import json

class ArticlePipeline(object):
    def process_item(self, item, spider):
        return item

# 使用json方式儲存json檔案
class JsonWithEncodingPipeline(object):
	"""docstring for JsonWithEncodingPipeline"""
	def __init__(self):
		self.file = codecs.open('article.json', 'w', encoding='utf-8')
	def process_item(self, item, spider):
		## `TypeError: Object of type 'date' is not JSON serializable`
		item["time"] = str(item["time"])    
		lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
		self.file.write(lines)
		return item
	def spider_closed(self, spider):
		self.file.close()

# 使用scrapy自帶的導入功能:JosnItemExporter
class JsonExporterPipeline(object):
	"""docstring for JsonExporterPipeline"""
	def __init__(self):
		self.file = open('articleExport.json', 'wb')
		self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
		self.exporter.start_exporting()
	def close_spider(self, spider):
		self.exporter.finish_exporting()
		self.file.close()
	def process_item(self, item, spider):
		self.exporter.export_item(item)
		return item

5、setting檔案設定

ITEM_PIPELINES

設定pipelines檔案中類的優先級，數字越小優先級越高，分别注釋

'Article.pipelines.JsonWithEncodingPipeline'

和

'Article.pipelines.JsonExporterPipeline'

使用不同的json儲存方法

# 設定請求頭部，添加url
DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

# 設定item——pipelines
ITEM_PIPELINES = {
# 	'Article.pipelines.ArticlePipeline': 300,
    'Article.pipelines.JsonWithEncodingPipeline': 2,
#    'Article.pipelines.JsonExporterPipeline': 2,
}

6、執行程式

scrapy crawl jobbole

報錯：

TypeError: Object of type 'date' is not JSON serializable

解決方法：item[“item”]的類型是date，需要轉化為str，使用如下：

item["time"] = str(item["time"])

四、爬取伯樂線上——進階

1、item loader機制

在上一節中，在spider檔案中定義爬取并解析item.py中定義的字段，但是可移植性不強，item loader機制提供了一種便捷的方式填充抓取到的 :Item。雖然Items可以使用自帶的類字典形式API填充，但是Items Loaders提供了更便捷的API，可以分析原始資料并對Item進行指派。

（1）思路

參考文章：爬蟲 Scrapy 學習系列之七：Item Loaders

通過item loader加載Item（spider檔案中）
- item loader三個主要的方法分别是： add_css(), add_xpath(), add_value()

from scrapy.loader import ItemLoader

# JobBoleArticleItem()為在items.py中聲明的執行個體，response為傳回的響應。
item_loader = ItemLoader(item=JobBoleArticleItem(), response=response)
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_value("url", response.url)
...
# 對結果進行解析，所有的結果都是一個list并儲存到article_item中。
article_item = item_loader.load_item()

通過items.py處理資料
- 引入 from scrapy.loader.processors import MapCompose,TakeFirst, Join
  等
  
  在scrapy.Field中可以加入處理函數，同時可自定義處理函數

（2）spider.py

spider.py檔案中部分代碼

from scrapy.loader import ItemLoader


class JobboleSpider(scrapy.Spider):
	"""
	添加部分，未變化的部分已省略
	"""
	def parse_detail(self, response):
		article_item = ArticleItem()
		front_image_url = response.meta.get("front_image_url", "")   # 文章封面圖
		item_loader = ArticleItemLoader(item=ArticleItem(), response=response)
		item_loader.add_css("title", ".entry-header h1::text")
		item_loader.add_value("url", response.url)
		item_loader.add_value("url_object_id", get_md5(response.url))
		item_loader.add_css("time", "p.entry-meta-hide-on-mobile::text")
		item_loader.add_value("front_image_url", [front_image_url])
		item_loader.add_css("coll_nums", ".vote-post-up h10::text")
		item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
		item_loader.add_css("fav_nums", ".bookmark-btn::text")
		item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
		item_loader.add_css("content", "div.entry")

		article_item = item_loader.load_item()


		yield article_item

（3）item.py檔案

定義相關處理函數，并利用

input_processor

或

output_processor

參數在輸入前、輸出後對字段中繼資料進行處理。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join

import datetime
import re


def date_convert(value):
	try:
		time = datetime.datetime.strptime(value, "%Y/%m/%d").date()
	except Exception as e:
		time = datetime.datetime.now().date()
	return time

def get_nums(value):
	match_re = re.match(".*?(\d+).*", value)
	if match_re:
		nums = int(match_re.group(1))
	else:
		nums = 0
	return nums

def return_value(value):
	return value

def remove_comment_tags(value):
	# 去掉tag中提取的評論
	if "評論" in value:
		return ""
	else:
		return value

class ArticleItemLoader(ItemLoader):
	# 自定義itemloader
	default_output_processor = TakeFirst()
		

class ArticleItem(scrapy.Item):
	# define the fields for your item here like:
	title = scrapy.Field()
	time = scrapy.Field(input_processor=MapCompose(date_convert))
	url = scrapy.Field()
	url_object_id = scrapy.Field() ## md5

	front_image_url = scrapy.Field(output_processor=MapCompose(return_value))
	front_image_path = scrapy.Field()
	
	fav_nums = scrapy.Field(input_processor=MapCompose(get_nums))
	coll_nums = scrapy.Field(input_processor=MapCompose(get_nums))
	comment_nums = scrapy.Field(input_processor=MapCompose(get_nums))
	
	content = scrapy.Field()
	tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags),
						output_processor=Join(","))


	def get_insert_sql(self):
		sql1 = "alter table article convert to character set utf8mb4;"
		insert_sql = """
			insert into article(title, time, url, url_object_id, front_image_url, front_image_path, coll_nums,comment_nums,fav_nums,content,tags) 
			VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
			ON DUPLICATE KEY UPDATE content=VALUES(fav_nums)
		"""

		front_image_url = ""
		if self["front_image_url"]:
			front_image_url = self["front_image_url"][0]
		params = (self["title"], self["time"], self["url"],self["url_object_id"],self["front_image_url"],
				self["front_image_path"],self["coll_nums"],self["comment_nums"],
				self["fav_nums"],self["content"],self["tags"])
		return insert_sql, params

2、pipelines檔案

（1）相關環境安裝（MySQL、Navicat）

安裝相關環境：Ubuntu18.04 安裝MySQL、Navicat

## mysqlclient是mysql的一個驅動
pip install mysqlclient

表定義如下圖所示：

【爬蟲筆記】Scrapy爬蟲技術文章網站一、Xpath二、CSS選擇器三、爬取伯樂線上——初級四、爬取伯樂線上——進階

（2）儲存到MySQL（同步機制）

import pymysql
import pymysql.cursors 

class MysqlPipeline(object):
	## 采用同步的機制寫入mysql
	"""docstring for MysqlPipeline"""
	def __init__(self):
		# self.conn = pymysql.connect('host', 'user', 'password', 'dbname', charset='utf8', use_unicode=True)
		self.conn = pymysql.connect(host='localhost', user='root', password='asdfjkl;', db='atricle', charset="utf8mb4", use_unicode=True)
		self.cursor = self.conn.cursor()
	def process_item(self, item, spider):
		sql1 = "alter table article convert to character set utf8mb4;"
		insert_sql = """
			insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)		
		"""
		self.cursor.execute(sql1)
		self.cursor.execute(insert_sql, 
			(pymysql.escape_string(item["title"]), 
			item["url"],
			item["url_object_id"],
			item["time"],
			item["coll_nums"],
			item["comment_nums"],
			item["fav_nums"],
			pymysql.escape_string(item["content"]),
			# item["url"], item["time"], item["coll_nums"]
			))
		self.conn.commit()

（3）儲存到MySQL（異步機制）

當采集量大時，爬取的速度要高于讀寫的速度，是以對于大型的一般采用異步機制存儲資料。

from twisted.enterprise import adbapi

class MysqlTwistedPipeline(object):
	"""docstring for MysqlTwistedPipeline"""
	def __init__(self, dbpool):
		self.dbpool = dbpool

	@classmethod
	def from_settings(cls, settings):
		'''傳入settings的參數'''
		dbparams = dict(
			host = settings['MYSQL_HOST'],
			db = settings['MYSQL_DB'],
			user = settings['MYSQL_USER'],
			password = settings['MYSQL_PASSWORD'],
			charset = "utf8mb4",
			cursorclass = pymysql.cursors.DictCursor,
			use_unicode = True,
		)
		dbpool = adbapi.ConnectionPool("pymysql", **dbparams)
		return cls(dbpool)

	def process_item(self, item, spider):
		# 使用twisted将mysql插入變成異步執行
		query = self.dbpool.runInteraction(self.do_insert, item)
		query.addErrback(self.handle_error, item, spider) #處理異常

	def handle_error(self, failure, item, spider):
		# 處理異步插入的異常
		print (failure)

	def do_insert(self, cursor, item):
		# 執行具體的插入
        # 根據不同的item 建構不同的sql語句并插入到mysql中
		sql1 = "alter table article convert to character set utf8mb4;"
		insert_sql = """
			insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)		
		"""
		cursor.execute(sql1)
		cursor.execute(insert_sql, 
			(pymysql.escape_string(item["title"]), 
			item["url"],
			item["url_object_id"],
			item["time"],
			item["coll_nums"],
			item["comment_nums"],
			item["fav_nums"],
			pymysql.escape_string(item["content"]),
			# item["url"], item["time"], item["coll_nums"]
			))

儲存結果如下圖所示：

【爬蟲筆記】Scrapy爬蟲技術文章網站一、Xpath二、CSS選擇器三、爬取伯樂線上——初級四、爬取伯樂線上——進階

【爬蟲筆記】Scrapy爬蟲技術文章網站一、Xpath二、CSS選擇器三、爬取伯樂線上——初級四、爬取伯樂線上——進階

文章目錄

一、Xpath

1、xpath簡介

2、xpath文法

二、CSS選擇器

三、爬取伯樂線上——初級

1、建立Scrapy項目

2、編寫item.py檔案

3、編寫spider檔案

4、編寫pipelines檔案——儲存在json檔案

5、setting檔案設定

6、執行程式

四、爬取伯樂線上——進階

1、item loader機制

（1）思路

（2）spider.py

（3）item.py檔案

2、pipelines檔案

（1）相關環境安裝（MySQL、Navicat）

（2）儲存到MySQL（同步機制）

（3）儲存到MySQL（異步機制）

繼續閱讀

libsvm for python 安裝

學習軟體測試基礎測試第七天

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

MySQL的4種隔離級别？出現問題

sort()函數到底是怎樣進行數字排序的

neo4j之cypher使用文檔

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

mysql使用source指令導入.sql檔案

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入