天天看點

Missing Values(缺失值)缺失值之心裡有數解決方案例子

文章目錄

  • 缺失值之心裡有數
  • 解決方案
    • 1.簡單的:删除有缺失值的列
    • 2. 較好的:缺失值插補(估算值)
    • 3.插補法拓展 ,沒懂?
  • 例子
    • 測試不同缺失值處理方式下的model score

缺失值之心裡有數

import pandas as pd
           
#統計缺失值的數量
missing_val_count_by_column= data.isnull().sum()
missing_val_count_by_column  #pandas Series類型
           
Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64
           
pandas.core.series.Series
           
#隻輸出有缺失值的
print(missing_val_count_by_column[missing_val_count_by_column > 0])
           
Car               62
BuildingArea    6450
YearBuilt       5375
CouncilArea     1369
dtype: int64
           
#判斷data的'Car'列有缺失值嗎?
data['Car'].isnull().any()  #該列有缺失值,則out:True
           
True
           
#看該列有多少個缺失值
data['Car'].isnull().sum()
           
62
           
#看Car列缺失值的具體情況
data['Car'].isnull()
           
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
13550     True
13551    False
13552    False
13553    False
13554    False
13555    False
13556    False
13557    False
13558    False
13559    False
13560    False
13561    False
13562    False
13563    False
13564    False
13565    False
13566    False
13567    False
13568    False
13569    False
13570    False
13571    False
13572    False
13573    False
13574    False
13575    False
13576    False
13577    False
13578    False
13579    False
Name: Car, Length: 13580, dtype: bool
           

解決方案

1.簡單的:删除有缺失值的列

當列中的值大部分都缺失時,它是有用的。

有training dataset和 test dataset, 假如想删除兩個DataFrame中相同的列:

col_with_missing= [col for col in original_data.columns 
                   if original_data[col].isnull().any()] #ifTrue的時候執行  什麼時候True: 該列有缺失值的時候
reduced_original_data= original_data.drop(col_with_missing, axis=1) #drop啊 not dropna
reduced_test_data= test_data.drop(col_with_missing, axis=1)
           

2. 較好的:缺失值插補(估算值)

以下預設填補了mean value

from sklearn.impute import SimpleImputer

my_imputer= SimpleImputer() #default: missing_values=np.nan, strategy='mean'   處理稀疏矩陣:missing_values=-1  其他政策:strategy="most_frequent"
data_with_imputed_values= my_imputer.fit_transform(original_data)
           

3.插補法拓展 ,沒懂?

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns
           

例子

import pandas as pd

melb_data= pd.read_csv(r'G:\kaggle\melb_data.csv')

#target
y= melb_data.Price

#剔除Price列
melb_predictors= melb_data.drop(['Price'], axis=1) 
#剔除非數值特征
melb_numeric_predictors= melb_predictors.select_dtypes(exclude=['object'])
           

測試不同缺失值處理方式下的model score

#Approach1——删除有缺失值的列

#找到有缺失值的列
col_with_missing= [col for col in melb_numeric_predictors 
                       if melb_numeric_predictors[col].isnull().any()]
#删掉有缺失值的列
reduced_melb_numeric_predictors= melb_numeric_predictors.drop(col_with_missing, axis=1)

           
#Approach2——缺失值插補(估算值) strategy='mean'
from sklearn.impute import SimpleImputer

my_imputer= SimpleImputer()
melb_numeric_predictors_with_imputed_values = my_imputer.fit_transform(melb_numeric_predictors)
           
#不同處理方式下,得到的訓練樣本不一樣。
#就定義一個Function隊不同的訓練樣本得到的 model score進行測試

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

def score(X, y):
    #split
    X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=0)
    #model
    melb_model= RandomForestRegressor()
    #fit
    clf= melb_model.fit(X_train, y_train)
    #score
    score= clf.score(X_test, y_test)
    
    return score
           
#test
score_drop_approach= score(reduced_melb_numeric_predictors, y)
score_impute_values= score(melb_numeric_predictors_with_imputed_values, y)
print("drop approach:",score_drop_approach)
print('impute approach:',score_impute_values)
           
('drop approach:', 0.7251907026905651)
('impute approach:', 0.74245443764218)
           

通過實驗可以看到:這裡,使用插值法的效果比直接删除法效果好。

繼續閱讀