training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

－－－－－－－－－－－－－－－－－－－－前言－－－－－－－－－－－－－－－－－－－－－－－－

在訓練視訊分類訓練的時候，發現tensorflow莫名的變慢2~5 sec /batch, 之前一直是0.4 sec/batch, 聯想到最早之前mxnet訓練分類時候的類似情況，決定做排查（都是同一台訓練伺服器上）：

（１）殺掉一些僵屍程序或多并行程序，eg. im2rec, 發現不見效，并且cpu使用率也不高，排除cpu性能的影響；

（２）殺掉一些系統程序，kworker, falcon-agent, gpu-monitor等，排除系統程序對訓練性能的影響；

（３）iostat -x檢視IO并不高，同時測試了image data loader，雖然性能有變化，但不是主要瓶頸占比，python cProfile熱點分析瓶頸在session.run()，htop/top發現占用記憶體不多，但cache很大；

（４）top cpu使用率并不理想（與我設定的線程數不般配），猜測是多線程競争引起，具體什麼原因，不清楚；

（５）github搜尋一些解決方法，tcmalloc有效（之前最簡單的方法是重新開機，現在我沒這麼嘗試^-^），總結如下：

一，－－－－－－－－－－－－－－－－重載new與delete－－－－－－－－－－－－－－－－－－－－－

首先，想明确一個c++中的概念，操作符的重載．直覺印象中，都認為可重載的操作符是++, --, +, -, *, /，(),[]等，其實，操作符中有兩個比較特殊的：operator new和operator delete，這兩個都是可以重載的，重載的他們兩個的主要目的是更好的記憶體管理（記憶體池）；

其次，明确記憶體池(memory pool)的概念和應用．作業系統原生的memory management經常不滿足要求，尤其在大型的用戶端程式中．大型程式中我們經常new(malloc)與delete(free)，會造成時間成本開銷，同時造成記憶體碎片memory fragmentation（碎片的概念請自行google），另外由于程式複雜，出現記憶體洩露也不好定位和查找（），此時記憶體池便應運而生，記憶體池經常需要重載new delete操作符，對windows中MFC開發熟悉的想必對如下宏定義不陌生：

#ifdef _DEBUG
        # define DEBUG_NEW new(THIS_FILE, __LINE__)
        #else
        # define DEBUG_NEW new
        #endif

MFC重載了new,進而友善的定位new的位置及檔案路徑，友善檢視記憶體洩露．另一個例子如下：

#include <stdio.h>
#include<stdlib.h>

//重載全局new
void *operator new(size_t size)
{
	printf("operator new: allocate %d bytes\n", size);
	void *ptr = malloc(sz);//在這裡也可以友善的操作自己的記憶體池
	return ptr;
}

//重載全局delete
void operator delete(void *ptr)
{
	puts("operator delete");
	free(ptr);
}

int main( )
{
	int *p=new int(0);//使用重載的operator new，不再使用系統的operator new
	delete p;
	return 0;
}

今天我們所說的tcmalloc或jemalloc就是類似的原理，如果選擇了tcmalloc，則我們程式中會采用tcmalloc中的記憶體管理機制

二，－－－－－－－－－－－－－－－－－－－－－tcmalloc是什麼？－－－－－－－－－－－－－－－－－

tcmalloc是一個記憶體配置設定器，管理堆記憶體，主要影響malloc和free，用于降低頻繁配置設定、釋放記憶體造成的性能損耗，并且有效地控制記憶體碎片。glibc中的記憶體配置設定器是ptmalloc2，tcmalloc要比它快。一次malloc和free操作，ptmalloc需要300ns，而tcmalloc隻要50ns。同時，tcmalloc也優化了小對象的存儲，需要更少的空間。tcmalloc特别對多線程做了優化，對于小對象的配置設定基本上是不存在鎖競争(lock contention)，而大對象使用了細粒度、高效的自旋鎖（spinlock）。配置設定給線程的本地緩存，在長時間空閑的情況下會被回收，供其他線程使用，這樣提高了在多線程情況下的記憶體使用率，不會浪費記憶體，而這一點ptmalloc2是做不到的。

以下摘自gperftools文檔，說出了tcmalloc的關鍵：

TCMalloc : Thread-Caching Malloc

Motivation

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

更詳細的tcmalloc可以參見google perf tools文檔

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

三，－－－－－－－－－－－－－－－－－－－－－training become slow－－－－－－－－－－－－－－－－

曾幾何時，無論是mxnet訓練，還是tensorflow訓練，你都會奇怪的發現如下現象：

trainning speed random slow down

or, period become slow at each batch,

or, once it has been being very fast for training, but this go today

while I have not change any training code before~~

or, the gpu has a low usage today(首先需要排除是否IO瓶頸引起)

遇到上述的情況，則有可能是cache的問題，正是因為使用系統自帶的記憶體配置設定機制引起的，

There may be some memory pressure caused by virtual address space fragmentation and high system buffer cache churn (reading large training datasets from the file system)

那麼嘗試解決：

（１）htop:有效的檢視cache的工具：

training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

htop發現cache占用太多，其中Mem的yellow bars代表cache占用情況

training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

htop說明

（２）Clear RAM Memory Cache

Linux provides a way to flush or clear ram cache.

How to Clear Cache in Linux?

Every Linux System has three options to clear cache without interrupting any processes or services.

　　　a. Clear PageCache only.

　　# sync; echo 1 > /proc/sys/vm/drop_caches                
　　　b. Clear dentries and inodes.

　　　# sync; echo ２ > /proc/sys/vm/drop_caches

　　　c. Clear PageCache, dentries and inodes.

　　　# sync; echo 3 > /proc/sys/vm/drop_caches

　　　　sync will flush the file system buffer. writing to drop_cache will clean cache without killing any application/service

　　　　If you have to clear the disk cache, the first command is safest in enterprise and production as

“...echo 1 > ….”

will clear the PageCache only. It is not recommended to use third option above

“...echo 3 >”

in production until you know what you are doing, as it will clear PageCache, dentries and inodes.

更詳細的說明請參見如下：

https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

　　　（３）使用tcmalloc when training:

TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting LD_PRELOAD=/usr/lib/libtcmalloc.so.

LD_PRELOAD=/usr/lib/libtcmalloc.so python train.py

此時，訓練中，配置設定buffer将使用tcmalloc中的記憶體池，提高效率，加速訓練，恢複正常訓練速度．

備注：在部署c++程式中，如果遇到多線程頻繁配置設定釋放的情況，頻繁操作大記憶體的情況，也可以通過觀測cache，嘗試利用tcmalloc改進．

　　　另外，tcmalloc不能在jni等形式下使用，因為jni有可能會導緻先load系統運作時配置設定記憶體，後加載tcmalloc釋放記憶體，造成記憶體沖突（切記在哪個庫中配置設定的記憶體在哪個庫中釋放）

training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

繼續閱讀

anaconda下鏡像快速安裝tensorflow和keras

anaconda中科大鏡像

安裝tensorflow1.12出現illegal hardware instruction python錯誤1、問題2、定位問題3、問題解決4、驗證

Linux下Anaconda安裝tensorflow-gpu

tensorflow筆記實踐：正則化優化過拟合

TensorFlow運作模型——會話

【Ubuntu-Tensorflow】TF1.0到TF1.2出現“Key LSTM/basic_lstm_cell/bias not found in checkpoin”問題

linux下的conda安裝tensorflow

Linux環境下 TensorFlow的安裝和使用基于Anaconda的tensorflow安裝

MindSpore儲存模型的格式疑惑

【Tensorflow】Tensorflow介紹

鸢尾花分類

利用tensorflow建構AlexNet模型，實作小數量級的貓狗分類（隻有train）

ImportError: libcublas.so.10.0: cannot open shared object file: No such file解決方法

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory（完美解決）

一種解決思路： ImportError: libcublas.so.10.0: cannot open shared object file: No such file