python逐行寫入資料_關于python：一種有效且實用的方法來周遊Dataframe并逐行寫入具有大量文本資料的文本檔案...

我有一個很大的資料框，其中每一行都包含大量文本資料，我試圖将此資料框分區到我的資料框的某個列上，即第11列，然後寫入多個檔案

15partitioncount = 5

trainingDataFile = 'sometrainingDatFileWithHugeTextDataInEachColumn.tsv'

df = pd.read_table(trainingDataFile, sep='\t', header=None, encoding='utf-8')

# prepare output files and keep them to append the dataframe rows

outputfiles = {}

filename ="C:\Input_Partition"

for i in range(partitioncount):

outputfiles[i] = open(filename +"_%s.tsv"%(i,),"a")

#Loop through the dataframe and write to buckets/files

for index, row in df.iterrows():

#partition on a hash function

partition = hash(row[11]) % partitioncount

outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) +"

此代碼導緻錯誤：

IndexError Traceback(最近一次通話)

在()中

---> 73個outputfiles [partition] .write(" t" .join([df.iloc [index] .values]中num的[str(num))+" n")

在getitem中的c： python27 lib site-packages pandas core indexing.pyc(self，key)

1326其他：

1327 key = com._apply_if_callable(key，self.obj)

-> 1328傳回self._getitem_axis(key，axis = 0)

1329

1330 def _is_scalar_access(self，key)：

_getitem_axis(self，key，axis)中的c： python27 lib site-packages pandas core indexing.pyc

1747

1748＃驗證位置

-> 1749 self._is_valid_integer(鍵，軸)

1750

1751 return self._get_loc(key，axis = axis)

_is_valid_integer中的c： python27 lib site-packages pandas core indexing.pyc(自，鍵，軸)

1636升= len(ax)

1637如果鍵> = l或鍵 1638提高IndexError("單個位置索引器超出範圍")

1639傳回True

1640

IndexError：單個位置索引器超出範圍

什麼是最有效，最可擴充的方法，即疊代資料幀的行，對行進行一些操作(我在上面的代碼中沒有顯示并且與手頭的問題無關)并最終寫入每行(包含大量文本)資料)轉換為文本檔案。

感謝任何幫助！

partition = hash(row[11]) % partitioncount讓我有些困惑。這應該怎麼辦？

它隻是一個散列函數，用于随機選擇存儲桶/檔案。它從第11列擷取值，對其進行哈希處理(以進行随機化)，然後應用模5，這将使您最多獲得5個分區。

IIUC您可以通過以下方式進行操作：

4filename = r'/path/to/output_{}.csv'

df.groupby(df.iloc[:, 11].map(hash) % partitioncount) \

.apply(lambda g: g.to_csv(filename.format(g.name), sep='\t', index=False))

這很酷，我學到了一些新東西，但是，我必須建立一緻的partitions，這意味着我需要限制每個組的示例數。即每個組每個分區的示例數不能超過1k。使用此解決方案，組的所有示例都将進入一個分區/檔案。