Python + Requests + Scrapy：一周天氣預報

2023-04-15 06:26:05

打開中國天氣首頁，選中某個城市，例如廣州。

進入城市天氣預報詳細頁面，選擇“7天”。

Python + Requests + Scrapy：一周天氣預報

分析一周天氣預報的結構體系，查找一周天氣預報的根節點。根節點是id="7d"的div标簽。

Python + Requests + Scrapy：一周天氣預報

解析每日天氣預報清單的節點詳細。每日天氣預報清單是一個包含多個li标簽、class="t clearfix"的ul标簽。每日天氣預報包含的有效資訊為：日期、天氣、溫度、風向等4個基本要素。

Python + Requests + Scrapy：一周天氣預報

由于中國天氣城市天氣預報的網頁是靜态頁面且頁面資料不是Javascript處理生成的，擷取該頁面的源碼不算複雜。總體上，步驟相對簡單。使用requests庫來擷取城市7天天氣預報的網頁源碼，通過scrapy的selector和xpath解析每日天氣預報清單中的節點，提取天氣預報的4個基本要素。天氣預報的基本要素被包含在簡單的HTML标簽之中，隻需将标簽替換掉即可。

最終的代碼如下：

"""
@author: MR.N
@created: 2021-08-22 12:30 AM Sun.

"""
import ssl
import urllib
from scrapy import Selector
import gzip


GUANG_ZHOU = 'http://www.weather.com.cn/weather/101280101.shtml'
headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 7.0; SM-G892A Build/NRD90M; wv) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/67.0.3396.87 Mobile Safari/537.36',
    'Accept-Encoding': 'gzip, deflate, br',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Site': 'none',
    'Upgrade-Insecure-Requests': '1',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive'
}


def get_local_weather(timeout=3):
    global GUANG_ZHOU
    global headers
    ssl._create_default_https_context = ssl._create_unverified_context
    req = urllib.request.Request(GUANG_ZHOU, headers=headers)
    opener = urllib.request.build_opener()
    res = opener.open(req, timeout=3)
    data = ''
    if res.getcode() == 200:
        data = res.read()
        if data is not None:
            data = gzip.decompress(data).decode('UTF-8', errors='strict')
    else:
        print(str(res.getcode()))
    if data is None or len(data) < 100:
        print('no data', data)
        return ''
    sel = Selector(text=data)
    groups = sel.xpath('//div[@id="7d"]/ul[@class="t clearfix"]/li').getall()
    weather_datas = ''
    if groups is None or len(groups) < 1:
        print('no match')
    else:
        print('[data]', len(groups))
        index = 0
        for group in groups:
            index += 1
            wea_data = ''
            temp_sel = Selector(text=group)
            date_str = temp_sel.xpath('//h1/text()').get()
            wea_str = temp_sel.xpath('//p[@class="wea"]/text()').get()
            temp_str = temp_sel.xpath('//p[@class="tem"]').get() \
                .replace('<p class="tem">', '').replace('</p>', '')\
                .replace('<i>', '').replace('</i>', '') \
                .replace('<span>', '').replace('</span>', '').replace('\n', '')
            win_str = temp_sel.xpath('//p[@class="win"]/i/text()').get().replace('&lt;', '')
            wea_data = '[' + date_str + ']  ' + wea_str + ' ' + temp_str + ' ' + win_str
            if index != 1:
                weather_datas += '\n'
            weather_datas += wea_data
    return weather_datas

Python + Requests + Scrapy：一周天氣預報

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入