可運作的最新的使用scrapy架構爬取鍊家租房資料

2023-04-24 21:20:16

看了一圈網上很多爬取鍊家資料的爬蟲，但是由于鍊家已經把網頁結構換掉了，是以原來的網上代碼已經不能夠使用，剛寫了一個最新的爬蟲，可以擷取租房資訊。

工具：

python3.6

scrapy 1.6.0 架構

vscode 編輯器

基礎知識關于如何使用scrapy架構什麼的就不說了，網上一搜一大堆。

說明：因為鍊家上面的資料排列并不是一樣的，有的資料是缺失的，這就給資料擷取造成很大麻煩。是以導緻有的資料是不正确的，大部分都是沒問題的。

下面是主要的spider代碼

# -*- coding: utf-8 -*-
from scrapy import Spider, Request
from scrapy.selector import Selector
import requests
import re
from lxml import etree
import json
import time
from  szhouse.items import  SzhouseItem


class licaiSpider(Spider):
    
    name = 'lianjiahouse'  #https://su.lianjia.com/zufang/pg3/#contentList
    start_urls=['https://su.lianjia.com/zufang/']

    custom_settings = {'ITEM_PIPELINES': {
       'szhouse.pipelines.SzhousePipeline': 300,
    }, 'DOWNLOAD_DELAY': 0.1,'DOWNLOAD_TIMEOUT':20}
    def parse(self,response):
        #detail_html=str(response.body)
        item = SzhouseItem()
        trs = response.xpath('//div[@class="content w1150"]/div[@class="content__article"]/div[@class="content__list"]/div[@class="content__list--item"]')

        for index,tr in enumerate(trs):
            item['house_name'] = tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--title twoline"]/a/text()').extract_first().replace(' ','').replace('\n', '')
            tt_link = tr.xpath('./a/@href').extract_first()
            item['link'] ='https://su.lianjia.com'+tt_link

            tt=tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/text()[4]').extract_first()
            tt_fx=tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/text()[6]').extract_first()
            tt_ad= tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/a[1]/text()').extract_first()
            #jinsheng
            tt_js_fx=tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/text()[5]').extract_first()
            tt_js_mj=tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/text()[3]').extract_first()
            if(tt_ad is None):
                item['address'] =tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--des"]/span/text()').extract_first()
                item['fangxing'] = str(tt_js_fx).replace(' ','').replace('\n','')
                item['mianji'] = str(tt_js_mj).replace(' ','').replace('\n','')
            else:
                item['address'] =tt_ad
                item['fangxing']=tt_fx.replace(' ','').replace('\n','')
                item['mianji']=tt.replace(' ','').replace('\n', '')

            
            item['fabuDate'] = tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--time oneline"]/text()').extract_first()
            # item['detail'] = tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--bottom oneline"]/i/text()').extract_first()
            d_list=tr.xpath('./div[@class="content__list--item--main"]/p[@class="content__list--item--bottom oneline"]/i')
            detail_list=''
            for ii in d_list:
                detail_list += ii.xpath('./text()').extract_first()
                detail_list=detail_list+','
             #   detail_list=''.join(ii.xpath('./text()').extract_first())
            item['detail']=detail_list
            item['money'] = tr.xpath('./div[@class="content__list--item--main"]/span/em/text()').extract_first()
            
            yield item
            
        
        for i in range(1,101):
            url='https://su.lianjia.com/zufang/pg{}/'.format(str(i))
            yield Request(url,callback=self.parse)

我把爬下來的資料儲存在csv檔案了，一共100也代碼大約8000多資料吧，如下所示：

可運作的最新的使用scrapy架構爬取鍊家租房資料

可運作的最新的使用scrapy架構爬取鍊家租房資料

繼續閱讀

Scrapy Crawl 運作出錯 AttributeError: 'xxxSpider' object has no attribute '_rules' 的問題解決

CrawlSpider一鍵爬取投标網

Spider和CrawlSpiderSpider和CrawlSpider

Python Scrapy 全站爬蟲

爬取豆瓣電影TP250（文字資訊+儲存圖檔）

Scrapy架構的一些學習心得Scrapy架構的一些學習心得

scrapy MapCompose 一些操作

windows下搭建爬蟲架構scrapy

scrapy與requests的了解與爬蟲優化想法

【Python】Scrapy爬蟲介紹&&requests爬蟲移植到Scrapy爬蟲ScrapyScrapy爬蟲執行個體編寫/re爬蟲移植

用scrapy爬取小說網站，并儲存到資料庫

Scrapy抓取在不同級别Request之間傳遞參數

scrapy在不同的Request之間傳遞參數的辦法

scrapy常用指令筆記

網絡蜘蛛Spider的邏輯Logic（二）

【崔慶才教材】《Python3網絡爬蟲開發實戰》3.4爬取貓眼電影排行代碼更正（繞過美團驗證碼）