pandas-從較大的CSV檔案中将少量随機樣本讀取到Python資料幀中
我要讀取的CSV檔案不适合主存儲器。 如何讀取其中的幾行(〜10K)随機行,并對所選資料幀進行一些簡單的統計?
8個解決方案
55 votes
假設CSV檔案中沒有标題:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
如果read_csv有一個保留行,或者如果跳過行使用了回調函數而不是清單,那會更好。
具有标題和未知檔案長度:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
dlm answered 2020-02-09T17:34:30Z
31 votes
@dlm的答案很好,但是從v0.20.0開始,skiprows确實接受了可調用對象。 可調用對象接收行号作為參數。
如果您可以指定所需的行數百分比,而不是指定多少行,則您甚至不需要擷取檔案大小,而隻需要通讀一次檔案即可。 假設标題在第一行:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
或者,如果您想乘第n行:
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
exp1orer answered 2020-02-09T17:34:59Z
20 votes
這不在Pandas中,但是通過bash可以更快地達到相同的結果,而不會将整個檔案讀入記憶體:
shuf -n 100000 data/original.tsv > data/sample.tsv
shuf指令将對輸入進行混洗,and和-n參數訓示在輸出中需要多少行。
相關問題:[https://unix.stackexchange.com/q/108581]
可在此處檢視700萬行CSV的基準(2008年):
最佳答案:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
熊貓計時:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
使用shuf時:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
是以shuf的速度快12倍左右,重要的是不會将整個檔案讀入記憶體。
Bar answered 2020-02-09T17:35:50Z
10 votes
這是一種算法,不需要事先計算檔案中的行數,是以您隻需要讀取一次檔案。
假設您要m個樣本。 首先,該算法保留前m個樣本。 當它以機率m / i看到第i個樣本(i> m)時,該算法将使用該樣本随機替換已選擇的樣本。
這樣,對于任何i> m,我們總是有從前i個樣本中随機選擇的m個樣本的子集。
請參見下面的代碼:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
desktable answered 2020-02-09T17:36:23Z
2 votes
以下代碼首先讀取标頭,然後讀取其他行上的随機樣本:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
queise answered 2020-02-09T17:36:43Z
1 votes
class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
我認為類似的東西應該起作用
Joran Beasley answered 2020-02-09T17:37:03Z
1 votes
沒有熊貓!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
您将得到一個sampled_lines清單。 您的意思是什麼統計?
Vagner Guedes answered 2020-02-09T17:37:27Z
1 votes
使用子樣本
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv
Zhongjun 'Mark' Jin answered 2020-02-09T17:37:48Z