利用Python進行資料分析第二版複現（五）

第06章資料加載、存儲與檔案格式

pandas提供了一些用于将表格型資料讀取為DataFrame對象的函數。表1對它們進行了總結，其中read_csv和read_table可能會是你今後用得最多的。

import pandas as pd
import numpy as np

df = pd.read_csv('examples/ex1.csv')
df

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

#下面這個指令電腦不能用
#!cat examples/ex1.csv

對于沒有标題行的資料，可以通過header屬性或者names屬性進行标注，假設你希望将message列做成DataFrame的索引。你可以明确表示要将該列放到索引4的位置上，也可

以通過index_col參數指定"message"。

1	2	3	4
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

parsed = pd.read_csv('examples/csv_mindex.csv',
                     index_col=['key1', 'key2'])
parsed

value1	value2
key1	key2
one	a	1	2
b	3	4
c	5	6
d	7	8
two	a	9	10
b	11	12
c	13	14
d	15	16

#有些表格可能不是用固定的分隔符去分隔字段的（比如空白符或其它模式）
list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

result = pd.read_table('examples/ex3.txt', sep='\s+')
result

A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

#可以用skiprows跳過檔案的第一行、第三行和第四行
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

result = pd.read_csv("examples/ex5.csv")
print(result)
pd.isnull(result)

something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

something	a	b	c	d	message
False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

pandas.read_csv和pandas.read_table常用的選項

利用Python進行資料分析第二版複現（五）

逐塊讀取文本檔案

我們先設定pandas顯示地更緊些.

pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv')
result

one	two	three	four	key
0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315

10000 rows × 5 columns

#隻想讀取幾行（避免讀取整個檔案），通過nrows進行指定即可：
pd.read_csv('examples/ex6.csv', nrows=5)

one	two	three	four	key
0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

#要逐塊讀取檔案，可以指定chunksize（行數）
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.TextFileReader at 0xb66a590>

chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

将資料寫出到文本格式

利用DataFrame的to_csv用法，我們可以将資料寫到1個以逗号分隔的檔案中

data = pd.read_csv('examples/ex5.csv')
data.to_csv('examples/out2020.csv')

import sys
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

#缺失值在輸出結果中會被表示為空字元串。你可能希望将其表示為别的标記值：
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

#如果沒有設定其他選項，則會寫出⾏和列的标簽。當然，它們也都可以被禁用
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

#可以隻寫出1部分的列，并以你指定的順序排列：
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0
9,10,11.0

處理分隔符格式

import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)

#對這個reader進行疊代将會為每行産生一個元組（并移除了所有的引号）：對這個reader進行疊代将會
#為每行産生一個元組（并移除了所有的引号）：
for line in reader:
    print(line)

with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

csv語支選項

利用Python進行資料分析第二版複現（五）

JSON資料

是1種比表格型文本格式（如CSV）靈活得多的資料格式。

obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},{"name": "Katie", "age": 38,"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

#json.dumps則将Python對象轉換成JSON格式：
asjson = json.dumps(result)
#如何将（1個或1組）JSON對象轉換為DataFrame或其他便于分析的資料結構就由你決定了。最簡單
#友善的方式是：向DataFrame構造器傳入1個字典的清單（就是原先的JSON對象），并選取資料字段
#的子集
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

name	age
Scott	30
1	Katie	38

#pandas.read_json的預設選項假設JSON數組中的每個對象是表格中的一行
data = pd.read_json('examples/example.json')
data

a	b	c
1	2	3
1	4	5	6
2	7	8	9

6.2 二進制資料格式

實作資料的⾼效⼆進制格式存儲最簡單的辦法之1是使用Python内置的pickle序列化。pandas對象都有1個用于将資料以pickle格式儲存到磁盤上的to_pickle用法。

frame = pd.read_csv('examples/ex1.csv')
frame

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

frame.to_pickle('examples/frame_pickle')
pd.read_pickle('examples/frame_pickle')

a	b	c	d	message
1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

#要使用ExcelFile，通過傳遞xls或xlsx路徑建立1個執行個體：
xlsx = pd.ExcelFile('examples/ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0	a	b	c	d	message
1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

Unnamed: 0	a	b	c	d	message
1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

#如果要将pandas資料寫入為Excel格式，你必須首先建立1個ExcelWriter，然後使用andas對象的
#to_excel方法将資料寫入到其中：
writer = pd.ExcelWriter('examples/ex2020.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

#還可以不使用ExcelWriter，而是傳遞檔案的路徑到to_excel：
frame.to_excel('examples/ex2019.xlsx')

6.3 Web APIs互動

使用requests包

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

data = resp.json()
data[0]['title']

"DOC: Fix the description of the 'day' field accessor in DatetimeArray"

issues = pd.DataFrame(data, columns=['number', 'title',
                                     'labels', 'state'])
issues

number	title	labels	state
31490	DOC: Fix the description of the 'day' field ac...	[]	open
1	31489	~ operator on Series with BooleanDtype casts t...	[]	open
2	31488	Unclosed file on EmptyDataError	[]	open
3	31487	Maybe wrong default axis with operators (add, ...	[]	open
4	31486	DOC: Parameter doc strings for Groupby.(sum\|pr...	[]	open
...	...	...	...	...
25	31459	ENH: pd.cut should be able to return a Series ...	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...	open
26	31458	Fix to_csv and to_excel links on read_csv, rea...	[]	open
27	31457	Timedelta multiplication crashes for large arrays	[{'id': 47223669, 'node_id': 'MDU6TGFiZWw0NzIy...	open
28	31456	BUG: Groupby.apply wasn't allowing for functio...	[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...	open
29	31455	jobs failling with error raise RuntimeError("C...	[{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc...	open

30 rows × 4 columns

6.4 資料庫互動

在商業場景下，大多數資料可能不是存儲在文本或Excel檔案中。基于SQL的關系型資料庫（如SQLServer、PostgreSQL和MySQL等）使用非常廣泛，其它⼀些資料庫也很流行。資料庫的選擇通常取決于性能、資料完整性以及應⽤程式的伸縮性需求。

将資料從SQL加載到DataFrame的過程很簡單，此外pandas還有1些能夠簡化該過程的函數。例如，我将使用SQLite資料庫（通過Python内置的sqlite3驅動器）。

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()

data = [('Atlanta', 'Georgia', 1.25, 6),
         ('Tallahassee', 'Florida', 2.6, 3),
         ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)

<sqlite3.Cursor at 0xe49e5a0>

cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

a	b	c	d
Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5

import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)

a	b	c	d

利用Python進行資料分析第二版複現（五）

第06章資料加載、存儲與檔案格式

逐塊讀取文本檔案

将資料寫出到文本格式

處理分隔符格式

JSON資料

6.2 二進制資料格式

6.3 Web APIs互動

6.4 資料庫互動

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入

利用Python進行資料分析第二版複現（五）

第06章 資料加載、存儲與檔案格式

逐塊讀取文本檔案

将資料寫出到文本格式

處理分隔符格式

JSON資料

6.2 二進制資料格式

6.3 Web APIs互動

6.4 資料庫互動

繼續閱讀

第06章資料加載、存儲與檔案格式