python讀取大檔案csv_pandas-從較大的CSV檔案中将少量随機樣本讀取到Python資料幀中...

pandas-從較大的CSV檔案中将少量随機樣本讀取到Python資料幀中

我要讀取的CSV檔案不适合主存儲器。如何讀取其中的幾行（〜10K）随機行，并對所選資料幀進行一些簡單的統計？

8個解決方案

55 votes

假設CSV檔案中沒有标題：

import pandas

import random

n = 1000000 #number of records in file

s = 10000 #desired sample size

filename = "data.txt"

skip = sorted(random.sample(xrange(n),n-s))

df = pandas.read_csv(filename, skiprows=skip)

如果read_csv有一個保留行，或者如果跳過行使用了回調函數而不是清單，那會更好。

具有标題和未知檔案長度：

import pandas

import random

filename = "data.txt"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 10000 #desired sample size

skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

dlm answered 2020-02-09T17:34:30Z

31 votes

@dlm的答案很好，但是從v0.20.0開始，skiprows确實接受了可調用對象。可調用對象接收行号作為參數。

如果您可以指定所需的行數百分比，而不是指定多少行，則您甚至不需要擷取檔案大小，而隻需要通讀一次檔案即可。假設标題在第一行：

import pandas as pd

import random

p = 0.01 # 1% of the lines

# keep the header, then take only 1% of lines

# if random from [0,1] interval is greater than 0.01 the row will be skipped

df = pd.read_csv(

filename,

header=0,

skiprows=lambda i: i>0 and random.random() > p

)

或者，如果您想乘第n行：

n = 100 # every 100th line = 1% of the lines

df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

exp1orer answered 2020-02-09T17:34:59Z

20 votes

這不在Pandas中，但是通過bash可以更快地達到相同的結果，而不會将整個檔案讀入記憶體：

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf指令将對輸入進行混洗，and和-n參數訓示在輸出中需要多少行。

相關問題：[https://unix.stackexchange.com/q/108581]

可在此處檢視700萬行CSV的基準（2008年）：

最佳答案：

def pd_read():

filename = "2008.csv"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 100000 #desired sample size

skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

df.to_csv("temp.csv")

熊貓計時：

%time pd_read()

CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s

Wall time: 18.9 s

使用shuf時：

time shuf -n 100000 2008.csv > temp.csv

real 0m1.583s

user 0m1.445s

sys 0m0.136s

是以shuf的速度快12倍左右，重要的是不會将整個檔案讀入記憶體。

Bar answered 2020-02-09T17:35:50Z

10 votes

這是一種算法，不需要事先計算檔案中的行數，是以您隻需要讀取一次檔案。

假設您要m個樣本。首先，該算法保留前m個樣本。當它以機率m / i看到第i個樣本（i> m）時，該算法将使用該樣本随機替換已選擇的樣本。

這樣，對于任何i> m，我們總是有從前i個樣本中随機選擇的m個樣本的子集。

請參見下面的代碼：

import random

n_samples = 10

samples = []

for i, line in enumerate(f):

if i < n_samples:

samples.append(line)

elif random.random() < n_samples * 1. / (i+1):

samples[random.randint(0, n_samples-1)] = line

desktable answered 2020-02-09T17:36:23Z

2 votes

以下代碼首先讀取标頭，然後讀取其他行上的随機樣本：

import pandas as pd

import numpy as np

filename = 'hugedatafile.csv'

nlinesfile = 10000000

nlinesrandomsample = 10000

lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)

df = pd.read_csv(filename, skiprows=lines2skip)

queise answered 2020-02-09T17:36:43Z

1 votes

class magic_checker:

def __init__(self,target_count):

self.target = target_count

self.count = 0

def __eq__(self,x):

self.count += 1

return self.count >= self.target

min_target=100000

max_target = min_target*2

nlines = randint(100,1000)

seek_target = randint(min_target,max_target)

with open("big.csv") as f:

f.seek(seek_target)

f.readline() #discard this line

rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split

print rand_lines

print rand_lines[0].split(",")

我認為類似的東西應該起作用

Joran Beasley answered 2020-02-09T17:37:03Z

1 votes

沒有熊貓！

import random

from os import fstat

from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read

lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped

min_bytes_to_skip = 10000

max_bytes_to_skip = 1000000

def is_EOF():

return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines

sampled_lines = []

for n in xrange(lines_to_read):

bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)

f.seek(bytes_to_skip, 1)

# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line

# Skip current entire line

f.readline()

if not is_EOF():

sampled_lines.append(f.readline())

else:

# Go to the begginig of the file ...

f.seek(0, 0)

# ... and skip lines again

f.seek(bytes_to_skip, 1)

# If it has reached the EOF again

if is_EOF():

print "You have skipped more lines than your file has"

print "Reduce the values of:"

print " min_bytes_to_skip"

print " max_bytes_to_skip"

exit(1)

else:

f.readline()

sampled_lines.append(f.readline())

print sampled_lines

您将得到一個sampled_lines清單。您的意思是什麼統計？

Vagner Guedes answered 2020-02-09T17:37:27Z

1 votes

使用子樣本

pip install subsample

subsample -n 1000 file.csv > file_1000_sample.csv

Zhongjun 'Mark' Jin answered 2020-02-09T17:37:48Z

python讀取大檔案csv_pandas-從較大的CSV檔案中将少量随機樣本讀取到Python資料幀中...

繼續閱讀

python讀取大檔案csv_python項目實踐分享：使用pandas處理大型CSV檔案

python讀取大檔案csv_實作讀取csv檔案，檔案裡面是有限個百分數成績（99.6、76.8等等...