Scrapy架構爬取部落格執行個體

爬取對象Livedoor Blog部落格（日本網站）
爬取資訊：部落格連結，名字，類别，投稿時間，評論次數
使用python Scrapy架構

文章目錄

Scrapy架構爬取部落格執行個體
- 定義items.py
- 在spider檔案夾裡建立Blogspider.py
- 在Setting檔案裡添加你浏覽器的資訊
- 結果輸出

定義items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class BlogscrapingItem(scrapy.Item):
    #ブログ名前
    name = scrapy.Field()
    #投稿時間
    time = scrapy.Field()
    #カテゴリー
    category = scrapy.Field()
    #コメント
    comment = scrapy.Field()
    #リンク
    link = scrapy.Field()
    #文章
    str = scrapy.Field()
    #図の數
    img_number = scrapy.Field()

在spider檔案夾裡建立Blogspider.py

# -*- coding: utf-8 -*-
import scrapy
from Blogscraping.items import BlogscrapingItem


class Blogspider(scrapy.Spider):
    # scrapy項目名稱
    name = 'Blog_spider'
    allowed_domains = ['jin115.com']
    # 起始URL
    start_urls = ['http://jin115.com/']

    # 解析方法
    def parse(self, response):
        blog_list = response.xpath("//div[@class='autopagerize_page_element']/section[@class='index_article_container']")
        
        for i_item in blog_list:
            blog_item = BlogscrapingItem()
            #ブログ名前
            blog_item['name'] = i_item.xpath(".//div[@class='index_article']/div[@class='index_article_header']/h2/a/text()").extract_first()
            #カテゴリー
            content = i_item.xpath(".//div[@class='index_article_header_header']/div[@class='index_article_header_category']/a[1]/text()").extract()
            for i_content in content:
                content_s = "".join(i_content.split())
                blog_item['category'] = content_s
            #投稿時間
            blog_item['time'] = i_item.xpath(".//div[@class='index_article']/div[@class='index_article_header']/div[@class='index_article_header_header']/div[@class='index_article_header_date']/time/text()").extract_first()
            #コメント
            blog_item['comment'] = i_item.xpath(".//div[@class='index_article']//div[@class='index_article_footer_comment']/a[2]/text()").extract_first()
            #リンク
            blog_item['link'] = i_item.xpath(".//div[@class='index_article']/div[@class='index_article_header']/h2/a/@href").extract_first()
            yield blog_item
       #次のページ
       nextLink = response.xpath("//div[@id='footer_navi']//li[@class='paging-next']/a/@href").extract()
        if nextLink:
        nextLink = nextLink[0]
            yield scrapy.Request(nextLink, callback=self.parse)

在Setting檔案裡添加你浏覽器的資訊

結果輸出

#輸出爬蟲【終端指令】

scrapy crawl Blogspider

#輸出并儲存csv【終端指令】

scrapy crawl Blogspider -o 檔案名稱.csv

#執行個體的結果

Scrapy架構爬取部落格執行個體Scrapy架構爬取部落格執行個體

Scrapy架構爬取部落格執行個體Scrapy架構爬取部落格執行個體

Scrapy架構爬取部落格執行個體

文章目錄

定義items.py

在spider檔案夾裡建立Blogspider.py

在Setting檔案裡添加你浏覽器的資訊

結果輸出

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入