天天看點

json pandas 記憶體溢出,減少此Pandas代碼讀取JSON檔案和腌制的記憶體使用

json pandas 記憶體溢出,減少此Pandas代碼讀取JSON檔案和腌制的記憶體使用

I can't figure out a way to reduce memory usage for this program further.

Basically, I'm reading from JSON log files into a pandas dataframe, but:

the list append function is what is causing the issue. It creates two different objects in memory, causing huge memory usage.

.to_pickle method of pandas is also a huge memory hog, because the biggest spike in memory is when writing to the pickle.

Here is my most efficient implementation to date:

columns = ['eventName', 'sessionId', "eventTime", "items", "currentPage", "browserType"]

df = pd.DataFrame(columns=columns)

l = []

for i, file in enumerate(glob.glob("*.log")):

print("Going through log file #%s named %s..." % (i+1, file))

with open(file) as myfile:

l += [json.loads(line) for line in myfile]

tempdata = pd.DataFrame(l)

for column in tempdata.columns:

if not column in columns:

try:

tempdata.drop(column, axis=1, inplace=True)

except ValueError:

print ("oh no! We've got a problem with %s column! It don't exist!" % (badcolumn))

l = []

df = df.append(tempdata, ignore_index = True)

# very slow version, but is most memory efficient

# length = len(df)

# length_temp = len(tempdata)

# for i in range(1, length_temp):

# update_progress((i*100.0)/length_temp)

# for column in columns:

# df.at[length+i, column] = tempdata.at[i, column]

tempdata = 0

print ("Data Frame initialized and filled! Now Sorting...")

df.sort(columns=["sessionId", "eventTime"], inplace = True)

print ("Done Sorting... Changing indices...")

df.index = range(1, len(df)+1)

print ("Storing in Pickles...")

df.to_pickle('data.pkl')

Is there an easy way to reduce memory? The commented code does the job but takes 100-1000x longer. I'm currently at 45% memory usage at max during the .to_pickle part, 30% during the reading of the logs. But the more logs there are, the higher that number goes.

解決方案

This answer is for general pandas dataFrame memory usage optimization:

Pandas loads in string columns as object type by default. For all the columns which have the type object, try to assign the type category to these columns by passing a dictionary to parameter dtypes of the read_csv function. Memory usage decreases dramatically for columns with 50% or less unique values.

Pandas reads in numeric columns as float64 by default. Use pd.to_numeric to downcast float64 type to 32 or 16 if possible. This again saves you memory.

Load in csv data chunk by chunk. Process it, and move on to the next chunk. This can be done by specifying value to the chunk_size parameter of read_csv method.