資料預處理之資料清洗案例

建議學習文章：https://zhuanlan.zhihu.com/p/111499325

https://mp.weixin.qq.com/s/jNoXHO4qU34gcha4zOGRLA

https://mp.weixin.qq.com/s/ra48vJTsQltydOtfoy5YHQ

參考

資料缺失、混亂、重複怎麼辦？最全資料清洗指南讓你所向披靡 (qq.com

資料清洗：從記錄集、表或資料庫中檢測和修正（或删除）受損或不準确記錄的過程。它識别出資料中不完善、不準确或不相關的部分，并替換、修改或删除這些髒亂的資料。

為了将資料清洗簡單化，本文介紹了一種新型完備分步指南，支援在 Python 中執行資料清洗流程。讀者可以學習找出并清洗以下資料的方法：

缺失資料；
不規則資料（異常值）；
不必要資料：重複資料（repetitive data）、複制資料（duplicate data）等；
不一緻資料：大寫、位址等；

該指南使用的資料集是知識追蹤資料集你可以換成你要用的資料

資料概況

# import packages
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

pd.options.mode.chained_assignment = None

# read the data
df = pd.read_csv(
    'F://su//study//知識追蹤學習路線//code//Deep-Knowledge-Tracing-master//examples//data//ASSISTments_skill_builder_data.csv')

# shape and data types of the data
print(df.shape)
print(df.dtypes)
# select numeric columns
df_numeric = df.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)

# select non numeric columns
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)

(525534, 30)

order_id int64

assignment_id int64

user_id int64

assistment_id int64

problem_id int64

original int64

correct int64

attempt_count int64

ms_first_response int64

tutor_mode object

answer_type object

sequence_id int64

student_class_id int64

position int64

type object

base_sequence_id int64

skill_id float64

skill_name object

teacher_id int64

school_id int64

hint_count int64

hint_total int64

overlap_time int64

template_id int64

answer_id float64

answer_text object

first_action float64

bottom_hint float64

opportunity float64

opportunity_original float64

dtype: object

['order_id' 'assignment_id' 'user_id' 'assistment_id' 'problem_id'

'original' 'correct' 'attempt_count' 'ms_first_response' 'sequence_id'

'student_class_id' 'position' 'base_sequence_id' 'skill_id' 'teacher_id'

'school_id' 'hint_count' 'hint_total' 'overlap_time' 'template_id'

'answer_id' 'first_action' 'bottom_hint' 'opportunity'

'opportunity_original']

['tutor_mode' 'answer_type' 'type' 'skill_name' 'answer_text']

cols = df.columns[:30] # first 30 columns
colours = ['#000099', '#ffff00'] # specify the colours - yellow is missing. blue is not missing.
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colours))

下表展示了前 30 個特征的缺失資料模式。橫軸表示特征名，縱軸表示觀察值/行數，黃色表示缺失資料，藍色表示非缺失資料。

例如，下圖中特征skill——id在多個行中存在缺失值。而特征skillname出現零星缺失值。

資料預處理之資料清洗案例

方法 2：缺失資料百分比清單

當資料集中存在很多特征時，我們可以為每個特征列出缺失資料的百分比

# if it's a larger dataset and the visualization takes too long can do this.
# % of missing.
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

order_id - 0%
assignment_id - 0%
user_id - 0%
assistment_id - 0%
problem_id - 0%
original - 0%
correct - 0%
attempt_count - 0%
ms_first_response - 0%
tutor_mode - 0%
answer_type - 0%
sequence_id - 0%
student_class_id - 0%
position - 0%
type - 0%
base_sequence_id - 0%
skill_id - 13%
skill_name - 15%
teacher_id - 0%
school_id - 0%
hint_count - 0%
hint_total - 0%
overlap_time - 0%
template_id - 0%
answer_id - 91%
answer_text - 18%
first_action - 0%
bottom_hint - 85%
opportunity - 0%
opportunity_original - 15%

方法 3：缺失資料直方圖

在存在很多特征時，缺失資料方圖也不失為一種有效方法。

要更深入地了解觀察值中的缺失值模式，我們可以用直方圖的形式進行可視化。

# first create missing indicator for features with missing data
for col in df.columns:
    missing = df[col].isnull()
    num_missing = np.sum(missing)

    if num_missing > 0:  
        print('created missing indicator for: {}'.format(col))
        df['{}_ismissing'.format(col)] = missing


# then based on the indicator, plot the histogram of missing values
ismissing_cols = [col for col in df.columns if 'ismissing' in col]
df['num_missing'] = df[ismissing_cols].sum(axis=1)

df['num_missing'].value_counts().reset_index().sort_values(by='index').plot.bar(x='index', y='num_missing')

資料預處理之資料清洗案例

如何處理缺失資料？

這方面沒有統一的解決方案。我們必須研究特定特征和資料集，據此決定處理缺失資料的最佳方式。

下面介紹了四種最常用的缺失資料處理方法。不過，如果情況較為複雜，我們需要創造性地使用更複雜的方法，如缺失資料模組化。

解決方案 1：丢棄觀察值

在計學中，該方法叫做成列删除（listwise deletion），需要丢棄包含缺失值的整列觀察值。

隻有在我們确定缺失資料無法提供資訊時，才可以執行該操作。否則，我們應當考慮其他解決方案。

此外，還存在其他标準。

例如，從缺失資料直方圖中，我們可以看到隻有少量觀察值的缺失值數量超過 35。是以，我們可以建立一個新的資料集 df_less_missing_rows，該資料集删除了缺失值數量超過 35 的觀察值。

解決方案 2：丢棄特征

與解決方案 1 類似，我們隻在确定某個特征無法提供有用資訊時才丢棄它。

例如，從缺失資料百分比清單中，我們可以看到 hospital_beds_raion 具備較高的缺失值百分比——47%，是以我們丢棄這一整個特征。

解決方案 3：填充缺失資料

當特征是數值變量時，執行缺失資料填充。對同一特征的其他非缺失資料取平均值或中位數，用這個值來替換缺失值。

當特征是分類變量時，用衆數（最頻值）來填充缺失值。

不規則資料（異常值）

異常值指與其他觀察值具備顯著差異的資料，它們可能是真的異常值也可能是錯誤。

如何找出異常值？

根據特征的屬性（數值或分類），使用不同的方法來研究其分布，進而檢測異常值。

方法 1：直方圖/箱形圖

當特征是數值變量時，使用直方圖和箱形圖來檢測異常值。

correct是學生回答問題的值隻有0 1我們可以檢驗一下

df['correct'].hist(bins=100)

df.boxplot(column=['correct'])#箱線圖

資料預處理之資料清洗案例

如何處理異常值？

盡管異常值不難檢測，但我們必須選擇合适的處理辦法。而這高度依賴于資料集和項目目标。

處理異常值的方法與處理缺失值有些類似：要麼丢棄，要麼修改，要麼保留。（讀者可以傳回上一章節處理缺失值的部分檢視相關解決方案。）

不必要資料

處理完缺失資料異常值，現在我們來看不必要資料，處理不必要資料的方法更加直接。

輸入到模型中的所有資料應服務于項目目标。不必要資料即無法增加價值的資料。

這裡将介紹三種主要的不必要資料類型。

不必要資料類型 1：資訊不足/重複

有時一個特征不提供資訊，是因為它擁有太多具備相同值的行。

如何找出重複資料？

我們可以為具備高比例相同值的特征建立一個清單。

num_rows = len(df.index)
low_information_cols = [] #

for col in df.columns:
    cnts = df[col].value_counts(dropna=False)
    top_pct = (cnts/num_rows).iloc[0]
    
    if top_pct > 0.95:
        low_information_cols.append(col)
        print('{0}: {1:.5f}%'.format(col, top_pct*100))
        print(cnts)
        print()

tutor_mode: 99.93664%
tutor    525201
test        333
Name: tutor_mode, dtype: int64

type: 100.00000%
MasterySection    525534
Name: type, dtype: int64

first_action_ismissing: 99.99391%
False    525502
True         32
Name: first_action_ismissing, dtype: int64

opportunity_ismissing: 99.99391%
False    525502
True         32
Name: opportunity_ismissing, dtype: int64

缺失值處理

資料預處理之資料清洗案例

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入