爬蟲執行個體1-爬取新聞清單和釋出時間

2017-11-11 23:50:00

一、建立工程

scrapy startproject shop

二、Items.py檔案代碼：

import scrapy

class ShopItem(scrapy.Item):

title = scrapy.Field()

time = scrapy.Field()

三、shopspider.py檔案爬蟲代碼

# -*-coding:UTF-8-*-

from shop.items import ShopItem

class shopSpider(scrapy.Spider):

name = "shop"

allowed_domains = ["news.xxxxxxx.xx.cn"]

def parse(self,response):

item = ShopItem()

item['title'] = response.xpath("//div[@class='txttotwe2']/ul/li/a/text()").extract()

item['time'] = response.xpath("//div[@class='txttotwe2']/ul/li/font/text()").extract()

yield item

四、pipelines.py檔案代碼（列印出内容）：

注意：如果在shopspider.py檔案中列印出内容則顯示的是unicode編碼，而在pipelines.py列印出來的資訊則是正常的顯示内容。

class ShopPipeline(object):

def process_item(self, item, spider):

count=len(item['title'])

print 'news count: ' ,count

for i in range(0,count):

print 'biaoti: '+item['title'][i]

print 'shijian: '+item['time'][i]

return item

五、爬取顯示的結果：

root@kali:~/shop# scrapy crawl shop --nolog

news count: 40

biaoti: xxx建成國家食品安全示範城市

shijian: (2017-06-16)

biaoti: xxxx考試開始報名

……………………

…………………..

本文轉自老鷹a 51CTO部落格，原文連結:http://blog.51cto.com/laoyinga/1940001

爬蟲執行個體1-爬取新聞清單和釋出時間

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入