天天看點

伺服器硬體配置深度學習訓練/GPU伺服器硬體配置

深度學習訓練/GPU伺服器硬體配置

現有配置:

cpu

# cpu個數
cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
# 每個實體cpu的核數
cat /proc/cpuinfo| grep "cpu cores"| uniq
# 邏輯cpu的個數
cat /proc/cpuinfo| grep "processor"| wc -l
# 
           

記憶體條

# 檢視記憶體條狀況
sudo dmidecode --type memory
           

下述是摘取的一部分。其中,最大記憶體為384G,槽數為6個,

Handle 0x003C, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 384 GB
	Error Information Handle: Not Provided
	Number Of Devices: 6
           

一個槽位的具體資料:

每個槽位插了32G,其中有兩個槽位安插了記憶體條。

同時有4*6個這樣的記憶體槽位,最理想的是每個槽位的記憶體條為

384/6=64

,目前是

2\*4\*32 = 256

和CPU傳輸的速率:2667MT/s(Mega-transfer per second)

Handle 0x003E, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x003C
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: P1_DIMMA1
	Bank Locator: P1_Node0_Channel0_Dimm1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2667 MT/s
	Manufacturer: Samsung
	Serial Number: 38ED2DAE
	Asset Tag: P1_DIMMA1_AssetTag (Date:18/15)
	Part Number: M393A4K40CB2-CTD    
	Rank: 2
	Configured Clock Speed: 2400 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown
           

同時也可以利用free檢視記憶體大小

$ free -h
           
IP CPU 記憶體/G 系統盤/G 資料盤 GPU
204 2*Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) 256 787 3T/3T/1.2T 10*2080Ti
199 Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) 256 196 1007G 8*2080Ti
198 256 800 10*1080TI
29 2*Intel® Xeon® Gold 5118 CPU @ 2.30GHz(12核) 256 393 484G/2.0T/4.6T 8*2080Ti

Failed to initialize NVML: Driver/library version mismatch

問題:

the driver was not installed correctly. This can happen if the previous driver was installed using the runfile installer and the new driver was installed using package manager, or vice versa. There are probably other scenarios as well.

Remove all previous package manager installs, and all previous runfile installer installs, then reinstall the driver.

我們之前安裝了.run檔案的cuda和nvidia驅動。之後又利用apt指令安裝了nvidia-cuda-toolkit和cuda。導緻版本沖突,驅動不比對問題。

解除安裝:

解除安裝cuda

解除安裝通過.run檔案安裝的cuda:

cd /usr/local/cuda-xx.x/bin/
sudo ./cuda-uninstaller
sudo rm -rf /usr/local/cuda-xx.x
           

解除安裝通過apt指令安裝的cuda:

通過dpkg查找對應的package是否删除幹淨:

dpkg -l
           

查找對應版本,我這邊裝的9.1.85。通過版本确認已經删除幹淨。

解除安裝nvidia

解除安裝通過.run檔案安裝的nvidia驅動:

sudo /usr/bin/nvidia-uninstall
           

解除安裝之前安裝的所有驅動,包括通過apt安裝的:

安裝

安裝cuda和nvidia驅動可以參考:

Ubuntu伺服器安裝nvidia-430.64、cuda-10.1,cudnn-7.6.0和anaconda

參考

當然也有些其他人遇到了相同的問題,采用的解決方式不一樣可以作為參考:

NVIDIA NVML Driver/library version mismatch [closed]

nvidia-smi傳回錯誤資訊‘Failed to initialize NVML: Driver/library version mismatch’

官方提供了遇到沖突時的解決方案:

Handle Conflicting Installation Methods

官方解除安裝cuda和nvidia(runfile檔案)的方式:

Uninstallation

繼續閱讀