一、csv檔案介紹

逗号分隔值（Comma-Separated Values，CSV，有時也稱為字元分隔值，因為分隔字元也可以不是逗号），其檔案以純文字形式存儲表格資料（數字和文本）。CSV檔案由任意數目的記錄組成，記錄間以某種換行符分隔；每條記錄由字段組成，字段間的分隔符是其它字元或字元串，最常見的是逗号或制表符。

在平時，經常會遇到csv檔案存儲标簽資訊，也會建立csv檔案記錄訓練的情況。是以對csv檔案的一些日常操作做一個總結，友善自己以後複習~~（搬運）~~

二、基本庫（csv、pandas）

（一）csv庫的基本操作

1 、将檔案a（已有）的前五行複制到檔案b（建立）

import csv
path_a=r'E:\pythonwork\data_raw.csv'
path_b=r'E:\pythonwork\data_pro.csv'
--------------------------------------------------------
#寫法1
#這種寫法，有打開open就要有關閉close！！
a=open(path_a) #open
#1.newline=''消除空格行
#2.檔案不存在則建立,'w'代表隻寫
b=open(path_b,'w',newline='')       #open
reader=csv.reader(a)
writer=csv.writer(b)
rows=[row for row in reader]
for row in rows[0:5]:
    writer.writerow(row)
a.close()    #close
b.close()    #close
-------------------------------------------------------
#寫法2
with open(path_a) as a:
	reader = csv.reader(a)
	rows=[row for row in reader]

with open(path_b,'w',newline='') as b:
    writer=csv.writer(b)
    for row in rows[0:5]:
        writer.writerow(row)

2、将檔案a的後五行追加到上一步建立的檔案b中

import csv
path_a=r'E:\pythonwork\data_raw.csv'
path_b=r'E:\pythonwork\data_pro.csv'
a=open(path_a) #open
#'a+'代表追加
b=open(path_b,'a+',newline='')       #open
reader=csv.reader(a)
writer=csv.writer(b)
rows=[row for row in reader]
for row in rows[-5::]:
    writer.writerow(row)
a.close()    #close
b.close()    #close

3、删除b檔案的某一列

import csv
path_b=r'E:\pythonwork\data_pro.csv'
b=open(path_b,'r') 
reader=csv.reader(b)
rows=[row for row in reader]
column_len=len(rows[0])
#删除中間列
target=column_len//2
for i,row in enumerate(rows):
	rows[i].remove(row[target])
b.close()
b=open(path_b,'w',newline='') 
writer=csv.writer(b)
for row in rows:
	writer.writerow(row)
b.close()

以我有限的使用次數以及非常主觀的感受，csv庫隻能進行基本的操作，行操作友善，列操作繁瑣。Pandas庫要更靈活。

（二）pandas庫的基本操作

Pandas 是 Python 語言的一個擴充程式庫，用于資料分析。
Pandas 是一個開放源碼、BSD 許可的庫，提供高性能、易于使用的資料結構和資料分析工具。
Pandas 名字衍生自術語 “panel data”（面闆資料）和 “Python data analysis”（Python 資料分析）。
Pandas 一個強大的分析結構化資料的工具集，基礎是 Numpy（提供高性能的矩陣運算）。
Pandas 可以從各種檔案格式比如 CSV、JSON、SQL、Microsoft Excel 導入資料。
Pandas 可以對各種資料進行運算操作，比如歸并、再成形、選擇，還有資料清洗和資料加工特征。
Pandas 廣泛應用在學術、金融、統計學等各個資料分析領域。

1、建立一個csv檔案執行寫入下表的内容

Costumer id	Name	Age
222645	張三	24
215421	李四	31
215958	王五	19

#檢視python版本
import sys
sys.version

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

import pandas as pd
id=['222645','215421','215958']
name=['張三','李四','王五']
age=['24','31','19']
df=pd.DataFrame({'Costumer id':id,'Name':name,'Age':age})
#df.to_csv('costumer-information.csv')  會有亂碼
df.to_csv('costumer-information.csv',encoding='gbk',index=None)  #含中文，需要改變編碼方式；儲存在同級目錄下

2、新加一列/行資訊

#添加一行
add_row_df=pd.DataFrame({'Costumer id':['222645'],'Name':['趙四'],'Age':['56']})
#new_df=pd.DataFrame({'Costumer id':'222645','Name':'趙四','Age':'56'})      會報錯
add_row_df.to_csv('costumer-information.csv',mode='a',encoding='gbk',header=0,index=None)  #含中文，需要改變編碼方式；儲存在同級目錄下

#添加一列——性别
add_col_df=pd.read_csv('costumer-information.csv',encoding='gbk')
add_col_df['sex']=['0','1','0','0']
add_col_df.to_csv('costumer-information.csv',columns=name.append('sex'),encoding='gbk',index=None,header=1)  #含中文，需要改變編碼方式；儲存在同級目錄下

3、删除操作

關于删除的操作可以參考這篇部落格《Python pandas 删除指定行/列資料》

4.舉個栗子——以跨境電商資料為例的資料分析

資料可以在這裡下載下傳

import pandas as pd
csvfile='ecommerce_data.csv'
df=pd.read_csv(csvfile,encoding='ISO-8859-1')  #不加編碼方式就會報錯，詳見下方連結的解釋
df.head()  #輸出前五行

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

#正式開始分析
df=pd.read_csv(csvfile,encoding='ISO-8859-1',parse_dates=['InvoiceDate'])   #将時間強制轉換格式
#删除不需要的列description
df.drop(['Description'],axis=1,inplace=True)
#查找Customer ID為空值的行，并填充辨別
df['CustomerID'] = df['CustomerID'].fillna('U')
#增加單條商品銷售總額
df['amount']=df['Quantity']*df['UnitPrice']
#處理時間
df['date'] = [x for x in df['InvoiceDate'].dt.date]
df['year'] = [x.year for x in df['InvoiceDate']]
df['month'] = [x.month for x in df['InvoiceDate']]
df['day'] = [x.day for x in df['InvoiceDate']]
df['time']= [x for x in df['InvoiceDate'].dt.time]
 
df.drop(['InvoiceDate'],axis=1,inplace=True)

#去除重複條目
df=df.drop_duplicates()
df.describe()
#Quantity存在負值，導緻amout存在負值，這代表着退貨

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

#異常值處理
#查找是否存在單價是非正數
df1=df.loc[df['UnitPrice']<=0]
#計算異常資料的占比
print('單價異常的資料占比：{}%'.format(float(df1.shape[0]/df.shape[0]*100)))
#查找退貨資料
df2=df.loc[df['Quantity']<=0]
#計算退貨資料的占比
print('退貨資料占比：{}%'.format(float(df2.shape[0]/df.shape[0]*100)))

#每年按月份統計退貨的金額
import numpy as np
tt = pd.pivot_table(df2,index='year',columns='month',values='amount',aggfunc={'amount':np.sum})
tt

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

df_use=df[(df['UnitPrice']>0)&(df['Quantity']>0)]
#每年按月份統計銷售金額
import numpy as np
pp = pd.pivot_table(df_use,index='year',columns='month',values='amount',aggfunc={'amount':np.sum})
#退貨金額統計
withdraw=np.abs(tt/pp)
#每個客戶所下的訂單數量
F_value = df_use.groupby('CustomerID')['InvoiceNo'].nunique()
#每個客戶的消費總金額
M_value = df_use.groupby('CustomerID')['amount'].sum()

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='darkgrid')
plt.hist(M_value[M_value<5000],bins=100)
plt.xlabel('Amount')
plt.ylabel('Customer number')
plt.show()

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

csv檔案的基本操作——Python一、csv檔案介紹二、基本庫（csv、pandas）

一、csv檔案介紹

二、基本庫（csv、pandas）

（一）csv庫的基本操作

1 、将檔案a（已有）的前五行複制到檔案b（建立）

2、将檔案a的後五行追加到上一步建立的檔案b中

3、删除b檔案的某一列

（二）pandas庫的基本操作

1、建立一個csv檔案執行寫入下表的内容

2、新加一列/行資訊

3、删除操作

4.舉個栗子——以跨境電商資料為例的資料分析

繼續閱讀

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

ICG表面修飾稀土上轉換納米顆粒，ICG-PCL-PEG-FA 吲哚菁綠-聚已内酯-聚乙二醇-葉酸

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Android電視機（機頂盒）初次開發的一些經驗分享

判斷浏覽器類型與版本以及ios安卓判别

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入