項目目标
由于大氣運動極為複雜,影響天氣的因素較多,而人們認識大氣本身運動的能力極為有限,是以天氣預報水準較低,預報員在預報實踐中,每次預報的過程都極為複雜,需要綜合分析,并預報各氣象要素,比如溫度、降水等。本項目需要訓練一個二分類模型,來預測在給定天氣因素下,城市是否下雨。
資料說明
本資料包含了來自澳洲多個氣候站的日常共15W的資料,項目随機抽取了1W條資料作為樣本。特征如下:
特征 | 含義 |
---|---|
Date | 觀察日期 |
Location | 擷取該資訊的氣象站的名稱 |
MinTemp | 以攝氏度為機關的低溫度 |
MaxTemp | 以攝氏度為機關的高溫度 |
Rainfall | 當天記錄的降雨量,機關為mm |
Evaporation | 到早上9點之前的24小時的A級蒸發量(mm) |
Sunshine | 白日受到日照的完整小時 |
WindGustDir | 在到午夜12點前的24小時中的強風的風向 |
WindGustSpeed | 在到午夜12點前的24小時中的強風速(km/h) |
WindDir9am | 上午9點時的風向 |
WindDir3pm | 下午3點時的風向 |
WindSpeed9am | 上午9點之前每個十分鐘的風速的平均值(km/h) |
WindSpeed3pm | 下午3點之前每個十分鐘的風速的平均值(km/h) |
Humidity9am | 上午9點的濕度(百分比) |
Humidity3am | 下午3點的濕度(百分比) |
Pressure9am | 上午9點平均海平面上的大氣壓(hpa) |
Pressure3pm | 下午3點平均海平面上的大氣壓(hpa) |
Cloud9am | 上午9點的天空被雲層遮蔽的程度,0表示完全晴朗的天空,而8表示它完全是陰天 |
Cloud3pm | 下午3點的天空被雲層遮蔽的程度 |
Temp9am | 上午9點的攝氏度溫度 |
Temp3pm | 下午3點的攝氏度溫度 |
項目過程
-處理缺失值
-删除與預測無關的特征
-随機抽樣
-對分類變量進行編碼
-處理異常值
-資料歸一化
-訓練模型
-模型預測
項目代碼(Jupyter)
import pandas as pd
import numpy as np
讀取資料 探索資料
weather = pd.read_csv("weather.csv", index_col=0)
weather.head()
weather.info()
<class \'pandas.core.frame.DataFrame\'>
Int64Index: 142193 entries, 0 to 142192
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MinTemp 141556 non-null float64
1 MaxTemp 141871 non-null float64
2 Rainfall 140787 non-null float64
3 Evaporation 81350 non-null float64
4 Sunshine 74377 non-null float64
5 WindGustDir 132863 non-null object
6 WindGustSpeed 132923 non-null float64
7 WindDir9am 132180 non-null object
8 WindDir3pm 138415 non-null object
9 WindSpeed9am 140845 non-null float64
10 WindSpeed3pm 139563 non-null float64
11 Humidity9am 140419 non-null float64
12 Humidity3pm 138583 non-null float64
13 Pressure9am 128179 non-null float64
14 Pressure3pm 128212 non-null float64
15 Cloud9am 88536 non-null float64
16 Cloud3pm 85099 non-null float64
17 Temp9am 141289 non-null float64
18 Temp3pm 139467 non-null float64
19 RainTomorrow 142193 non-null object
dtypes: float64(16), object(4)
memory usage: 22.8+ MB
删除與預測無關的特征
weather.drop(["Date", "Location"],inplace=True, axis=1)
删除缺失值,重置索引
weather.dropna(inplace=True)
weather.index = range(len(weather))
1.WindGustDir WindDir9am WindDir3pm 屬于定性資料中的無序資料——OneHotEncoder
2.Cloud9am Cloud3pm 屬于定性資料中的有序資料——OrdinalEncoder
3.RainTomorrow 屬于标簽變量——LabelEncoder
為了簡便起見,WindGustDir WindDir9am WindDir3pm 三個風向中隻保留第一個最強風向
weather_sample.drop(["WindDir9am", "WindDir3pm"], inplace=True, axis=1)
編碼分類變量
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,LabelEncoder
print(np.unique(weather_sample["RainTomorrow"]))
print(np.unique(weather_sample["WindGustDir"]))
print(np.unique(weather_sample["Cloud9am"]))
print(np.unique(weather_sample["Cloud3pm"]))
[\'No\' \'Yes\']
[\'E\' \'ENE\' \'ESE\' \'N\' \'NE\' \'NNE\' \'NNW\' \'NW\' \'S\' \'SE\' \'SSE\' \'SSW\' \'SW\' \'W\'
\'WNW\' \'WSW\']
[0. 1. 2. 3. 4. 5. 6. 7. 8.]
[0. 1. 2. 3. 4. 5. 6. 7. 8.]
# 檢視樣本不均衡問題,較輕微
weather_sample["RainTomorrow"].value_counts()
No 7750
Yes 2250
Name: RainTomorrow, dtype: int64
# 編碼标簽
weather_sample["RainTomorrow"] = pd.DataFrame(LabelEncoder().fit_transform(weather_sample["RainTomorrow"]))
# 編碼Cloud9am Cloud3pm
oe = OrdinalEncoder().fit(weather_sample["Cloud9am"].values.reshape(-1, 1))
weather_sample["Cloud9am"] = pd.DataFrame(oe.transform(weather_sample["Cloud9am"].values.reshape(-1, 1)))
weather_sample["Cloud3pm"] = pd.DataFrame(oe.transform(weather_sample["Cloud3pm"].values.reshape(-1, 1)))
# 編碼WindGustDir
ohe = OneHotEncoder(sparse=False)
ohe.fit(weather_sample["WindGustDir"].values.reshape(-1, 1))
WindGustDir_df = pd.DataFrame(ohe.transform(weather_sample["WindGustDir"].values.reshape(-1, 1)), columns=ohe.get_feature_names())
WindGustDir_df.tail()
合并資料
weather_sample_new = pd.concat([weather_sample,WindGustDir_df],axis=1)
weather_sample_new.drop(["WindGustDir"], inplace=True, axis=1)
weather_sample_new
調整列順序,将數值型變量與分類變量分開,便于資料歸一化
Cloud9am = weather_sample_new.iloc[:,12]
Cloud3pm = weather_sample_new.iloc[:,13]
weather_sample_new.drop(["Cloud9am"], inplace=True, axis=1)
weather_sample_new.drop(["Cloud3pm"], inplace=True, axis=1)
weather_sample_new["Cloud9am"] = Cloud9am
weather_sample_new["Cloud3pm"] = Cloud3pm
RainTomorrow = weather_sample_new["RainTomorrow"]
weather_sample_new.drop(["RainTomorrow"], inplace=True, axis=1)
weather_sample_new["RainTomorrow"] = RainTomorrow
weather_sample_new.head()
為了防止資料歸一化受到異常值影響,在此之前先處理異常值
# 觀察資料異常情況
weather_sample_new.describe([0.01,0.99])
因為資料歸一化隻針對數值型變量,是以将兩者進行分離
# 對數值型變量和分類變量進行切片
weather_sample_mv = weather_sample_new.iloc[:,0:14]
weather_sample_cv = weather_sample_new.iloc[:,14:33]
蓋帽法處理異常值
## 蓋帽法處理數值型變量的異常值
def cap(df,quantile=[0.01,0.99]):
for col in df:
# 生成分位數
Q01,Q99 = df[col].quantile(quantile).values.tolist()
# 替換異常值為指定的分位數
if Q01 > df[col].min():
df.loc[df[col] < Q01, col] = Q01
if Q99 < df[col].max():
df.loc[df[col] > Q99, col] = Q99
cap(weather_sample_mv)
weather_sample_mv.describe([0.01,0.99])
資料歸一化
from sklearn.preprocessing import StandardScaler
weather_sample_mv = pd.DataFrame(StandardScaler().fit_transform(weather_sample_mv))
weather_sample_mv
重新合并資料
weather_sample = pd.concat([weather_sample_mv, weather_sample_cv], axis=1)
weather_sample.head()
劃分特征與标簽
X = weather_sample.iloc[:,:-1]
y = weather_sample.iloc[:,-1]
print(X.shape)
print(y.shape)
(10000, 32)
(10000,)
建立模型與交叉驗證
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score
for kernel in ["linear","poly","rbf"]:
accuracy = cross_val_score(SVC(kernel=kernel), X, y, cv=5, scoring="accuracy").mean()
print("{}:{}".format(kernel,accuracy))
linear:0.8564
poly:0.8532
rbf:0.8531000000000001