目标:爬取起點小說X類型小說前X頁的所有小說并将所有簡介做成詞雲
power by:
- Python 3.6
- Scrapy 1.4
- pymysql
- wordcloud
- jieba
- macOS 10.12.6
項目位址:https://github.com/Dengqlbq/NovelSpiderAndWordcloud.git
Step 1——建立project
cd YOURPATH
scrapy startproject QiDian
QiDian project的預設結構
Step 2——編寫item
Scrapy的主要分工:(簡略版且特指本例)
spider :爬取網頁并解析内容,将内容放置到Item,将新到request放入隊列
item :内容存儲容器
pipelines :處理Item,如存入資料庫或寫入檔案
Item決定了我們存儲什麼資料,本例中存儲的是 作者名,書名,簡介
# items.py
import scrapy
class QiDianNovelItem(scrapy.Item):
# define the fields for your item here
name = scrapy.Field()
author = scrapy.Field()
intro = scrapy.Field()
step 3——編寫spider
在spider檔案夾中建立 ‘QiDianNovelSpider.py’
# QiDianNovelSpider.py
from QiDian.items import QiDianNovelItem
from scrapy.spider import Spider
from scrapy import Request
class QiDianNovelSpider(Spider):
name = 'qi_dian_novel_spider'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/53.0.2785.143 Safari/537.36'
}
page = 1
url = 'http://f.qidian.com/all?chanId=21&orderId=&page=1&vip=\
hidden&style=1&pageSize=20&siteid=1&hiddenField=1&page=%d'
def start_requests(self):
yield Request(self.url % self.page, headers=self.header)
def parse(self, response):
item = QiDianNovelItem()
novels = response.xpath('//ul[@class="all-img-list cf"]/li/div[@class="book-mid-info"]')
for novel in novels:
item['name'] = novel.xpath('.//h4/a/text()').extract()[0]
item['author'] = novel.xpath('.//p[@class="author"]/a[1]/text()').extract()[0]
item['intro'] = novel.xpath('.//p[@class="intro"]/text()').extract()[0]
yield item
if self.page < 20:
self.page += 1
yield Request(self.url % self.page, headers=self.header)
spider首次啟動時由start_requests()提供request對象,以後從隊列中擷取
spider根據request爬取網頁并封裝成response對象
spider預設調用parse()處理response
parse()拆封response從中提取結構化資訊存儲到items,并将新到request放到隊列中
step 4——編寫pipelines
編寫好spider和item其實已經可以工作了,這時item中到資訊會列印到螢幕上,
也可以通過指令行參數寫入到檔案中,不過我們是要把資訊存儲到資料庫中
# pipelines.py
import pymysql
class QiDianPipeline(object):
def __init__(self):
self.connect = pymysql.connect(
host='127.0.0.1',
db='Scrapy_test',
user='Your_user',
passwd='Your_pass',
charset='utf8',
use_unicode=True)
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
sql = 'insert into Scrapy_test.novel(name,author,intro) values (%s,%s,%s)'
self.cursor.execute(sql, (item['name'], item['author'], item['intro']))
self.connect.commit()
return item
然後在setting.py中添加itempipeline
ITEM_PIPELINES = {
'QiDian.pipelines.QiDianPipeline': 300,
}
記得先建立資料庫
資料庫配置資訊最好寫在setting.py中,讀取時:host=settings['MYSQL_HOST']
step 5——爬取資料
代碼已經寫好,測試下爬蟲是否正常工作
cd YOURPATH
scrapy crawl qi_dian_novel_spider
step 6——制作詞雲
爬蟲正常工作,資料已經入庫,接下來制作詞雲
# mywordcloud.py
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy
import jieba
import pymysql
connect = pymysql.connect(
host='127.0.0.1',
db='Scrapy_test',
user='Your_user',
passwd='Your_pass',
charset='utf8',
use_unicode=True)
cursor = connect.cursor()
sql = 'select intro from Scrapy_test.novel'
cursor.execute(sql)
result = cursor.fetchall()
txt = ''
for r in result:
txt += r[0].strip()+'。'
wordlist = jieba.cut(txt)
ptxt = ' '.join(wordlist)
image = numpy.array(Image.open('Girl.png')) # 自定義圖檔
# 需要換成支援中文的字型,wordcloud自帶的字型不支援中文,會顯示亂碼
wc = WordCloud(background_color='white', max_words=500, max_font_size=60, mask=image, font_path='FangSong_GB2312.ttf').generate(ptxt)
plt.imshow(wc)
plt.axis("off")
plt.show()
效果如下: