Python For Data Analysis -- Pandas

首先pandas的作者就是這本書的作者

對于numpy，我們處理的對象是矩陣

pandas是基于numpy進行封裝的，pandas的處理對象是二維表（tabular, spreadsheet-like），和矩陣的差別就是，二維表是有中繼資料的

用這些中繼資料作為index更友善，而numpy隻有整形的index，但本質是一樣的，是以大部分操作是共通的

大家碰到最多的二維表應用，關系型資料庫中的表，有列名和行号，這些就是中繼資料

當然你可以用抽象的矩陣來對這些二維表做統計，但使用pandas會更友善

series

a series is a one-dimensional array-like object containing an array of data (of any numpy data type) and an associated array of data labels, called its index.

簡單的了解，就是字典，或一維表；不顯式指定index時，會自動添加 0 through n - 1的整數作為index

這裡可以簡單的替換index，生成新的series，

大家想想，對于numpy而言，沒有顯式的指定index，但也是可以通過整形的index取到資料的，這裡的index其實本質上和numpy的整形index是一樣的

是以對于numpy的操作，也同樣适用于pandas

同時，上面說了series其實就是字典，是以也可以用python字典來初始化

dataframe

a dataframe represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

如果接觸過r，應該對dataframe很熟悉，其實pandas就從某種程度上模拟出r的一些功能

是以如果用python也可以像r一樣友善的做統計，那何必要再去用r

上面series是字典或一維表，

dataframe是二維表，也可以看作是series的字典

指定了列名，行名是自動生成的

同時也可以指定行名，這裡增加了debt列，但是沒有資料，是以是nan

可以為debt，指派

取行，用ix

也可以用嵌套字典來建立dataframe，其實是series的字典，series本身就是字典，是以就是嵌套的字典

可以像numpy矩陣一樣，轉置

下面看看到底pandas在這些資料結構上提供了哪些友善的functions

reindexing

a critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

其實就是更改indexing

增加e，并預設填上0

還可以通過method參數，來指定填充方式

可以選擇向前或向後填充

對于二維表，可以在index和columns上同時進行reindex

reindex的參數，

dropping entries from an axis

用axis指定次元，對于二維表，行是0，列是1

indexing, selection, and filtering

基本和numpy差不多

arithmetic and data alignment

資料對齊和自動填充是pandas比較友善的一點

in [136]: df1 = dataframe(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

in [137]: df2 = dataframe(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

可以看到預設情況下，隻有兩個df都有的情況下，才會相加，否則為nan

我覺得大部分情況，應該是希望有一個就加一個，即把沒有的初始化為0

除了add，還支援

function application and mapping

1. element-wise：numpy ufuncs (element-wise array methods) work fine with pandas objects:

另一種element-wise，使用applymap

2. 可以将func apply到每一行或每一列

比較複雜的case

3.對于某個行或列，即series進行map

提供很多類似r的統計函數，

提供類似r中的descirbe，很友善

對非數值型，執行describe

彙總表，

correlation and covariance，相關系數和協方差

對msft和ibm之間求相關系數和協方差

也可以求出相關系數矩陣和協方差矩陣

unique values, value counts, and membership

in [217]: obj = series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

in [218]: uniques = obj.unique()

in [219]: uniques

out[219]: array([c, a, d, b], dtype=object)

in [220]: obj.value_counts()

out[220]:

c 3

a 3

b 2

d 1

提供一些用于處理missing data的工具函數

其中fillna複雜些，

hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

可以使用多層分級的index，其實本質等同于增加一維，是以相當于用低維來模拟高維資料

并且是支援，通過unstack和stack來還原多元資料的

pandas還提供其他功能，尤其是etl功能，友善資料處理

比如和各種檔案讀入和寫出的功能

cleaning, transform(基于map), merge(join)……

本文章摘自部落格園，原文釋出日期：2014-08-12

Python For Data Analysis -- Pandas

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入