用CSS方法提取網頁内容

2023-06-08 08:19:54

昨天用xpath提取了網頁内容，今天用CSS方法重新提取一遍。

随便在伯樂線上找一篇文章，網址：http://blog.jobbole.com/113555/

#通過CSS選擇器提取網頁的字段
        #标題
        title = response.css(".entry-header h1::text").extract_first()
        #釋出日期
        create_data = response.css(".entry-meta-hide-on-mobile::text").extract()[].strip()
        #标簽
        tag_list = response.css(".entry-meta-hide-on-mobile a::text").extract()
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)
        #點贊數
        praise_nums = response.css(".vote-post-up h10::text").extract_first()
        #收藏數
        fav_nums = response.css("span.btn-bluet-bigger:nth-child(2)::text").extract_first()
        match_re = re.match(".*?(\d+).*", fav_nums)
        if match_re:
            fav_nums = match_re.group()
        # 評論數
        comment_nums = response.css("a[href='#article-comment'] span::text").extract_first()
        match_re = re.match(".*?(\d+).*", comment_nums)
        if match_re:
            comment_nums = match_re.group()
        #正文
        content = response.css("div .entry").extract()[]

通過debug調試，可以實作。

用CSS方法提取網頁内容

提取的内容儲存為json格式，上一篇博文已經寫到。

http://blog.csdn.net/shengshengshiwo/article/details/79248421

用CSS方法提取網頁内容

繼續閱讀

Scrapy 爬蟲架構入門案例詳解

linux分布式scrapy爬蟲之安裝scrapy-redis

Scrapy爬取知名網站的圖書資訊打開虛拟環境，建立項目檔案打開所爬取網站，分析所爬取的内容。編寫代碼，實作爬蟲運作代碼，抓取資料

python3 scrapy 爬蟲實戰之爬取站長之家

爬蟲scrapy架構的學習

通過scrapy，從模拟登入開始爬取知乎的問答資料