天天看点

服务器硬件配置深度学习训练/GPU服务器硬件配置

深度学习训练/GPU服务器硬件配置

现有配置:

cpu

# cpu个数
cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
# 每个物理cpu的核数
cat /proc/cpuinfo| grep "cpu cores"| uniq
# 逻辑cpu的个数
cat /proc/cpuinfo| grep "processor"| wc -l
# 
           

内存条

# 查看内存条状况
sudo dmidecode --type memory
           

下述是摘取的一部分。其中,最大内存为384G,槽数为6个,

Handle 0x003C, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 384 GB
	Error Information Handle: Not Provided
	Number Of Devices: 6
           

一个槽位的具体数据:

每个槽位插了32G,其中有两个槽位安插了内存条。

同时有4*6个这样的内存槽位,最理想的是每个槽位的内存条为

384/6=64

,目前是

2\*4\*32 = 256

和CPU传输的速率:2667MT/s(Mega-transfer per second)

Handle 0x003E, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x003C
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: P1_DIMMA1
	Bank Locator: P1_Node0_Channel0_Dimm1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2667 MT/s
	Manufacturer: Samsung
	Serial Number: 38ED2DAE
	Asset Tag: P1_DIMMA1_AssetTag (Date:18/15)
	Part Number: M393A4K40CB2-CTD    
	Rank: 2
	Configured Clock Speed: 2400 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown
           

同时也可以利用free查看内存大小

$ free -h
           
IP CPU 内存/G 系统盘/G 数据盘 GPU
204 2*Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) 256 787 3T/3T/1.2T 10*2080Ti
199 Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) 256 196 1007G 8*2080Ti
198 256 800 10*1080TI
29 2*Intel® Xeon® Gold 5118 CPU @ 2.30GHz(12核) 256 393 484G/2.0T/4.6T 8*2080Ti

Failed to initialize NVML: Driver/library version mismatch

问题:

the driver was not installed correctly. This can happen if the previous driver was installed using the runfile installer and the new driver was installed using package manager, or vice versa. There are probably other scenarios as well.

Remove all previous package manager installs, and all previous runfile installer installs, then reinstall the driver.

我们之前安装了.run文件的cuda和nvidia驱动。之后又利用apt命令安装了nvidia-cuda-toolkit和cuda。导致版本冲突,驱动不匹配问题。

卸载:

卸载cuda

卸载通过.run文件安装的cuda:

cd /usr/local/cuda-xx.x/bin/
sudo ./cuda-uninstaller
sudo rm -rf /usr/local/cuda-xx.x
           

卸载通过apt命令安装的cuda:

通过dpkg查找对应的package是否删除干净:

dpkg -l
           

查找对应版本,我这边装的9.1.85。通过版本确认已经删除干净。

卸载nvidia

卸载通过.run文件安装的nvidia驱动:

sudo /usr/bin/nvidia-uninstall
           

卸载之前安装的所有驱动,包括通过apt安装的:

安装

安装cuda和nvidia驱动可以参考:

Ubuntu服务器安装nvidia-430.64、cuda-10.1,cudnn-7.6.0和anaconda

参考

当然也有些其他人遇到了相同的问题,采用的解决方式不一样可以作为参考:

NVIDIA NVML Driver/library version mismatch [closed]

nvidia-smi返回错误信息‘Failed to initialize NVML: Driver/library version mismatch’

官方提供了遇到冲突时的解决方案:

Handle Conflicting Installation Methods

官方卸载cuda和nvidia(runfile文件)的方式:

Uninstallation

继续阅读