天天看點

Pytorch報錯:RuntimeError: cuda runtime error (710) : device-side assert triggered

最近在研究分類網絡MSD-Net,打算跑cifar10和cifar100資料集來複現一下論文的研究結果。

GitHub:MSD-Net的Pytorch版

總體代碼寫完後,跑cifar10資料集非常完美,分類準确率比ResNet和DenseNet都有提升。但是跑cifar100資料集的時候就報了題目的錯

C:/w/1/s/windows/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=C:/w/1/s/windows/pytorch/aten/src\THCUNN/generic/ClassNLLCriterion.cu line=110 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "C:/Users/15338/Desktop/pycharm_ssh/lzh/cifar_MSDNet.py", line 203, in <module>
    train(criterion, optimizer, trainloader)
  File "C:/Users/15338/Desktop/pycharm_ssh/lzh/cifar_MSDNet.py", line 91, in train
    loss += criterion(outputs[j], labels_var)
  File "C:\Pycharm Pro\Project1\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Pycharm Pro\Project1\lib\site-packages\torch\nn\modules\loss.py", line 916, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "C:\Pycharm Pro\Project1\lib\site-packages\torch\nn\functional.py", line 2009, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "C:\Pycharm Pro\Project1\lib\site-packages\torch\nn\functional.py", line 1838, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at C:/w/1/s/windows/pytorch/aten/src\THCUNN/generic/ClassNLLCriterion.cu:110

           

這是因為程式在計算loss的時候發現分類數量和标簽數量不一緻導緻出錯。在網上找了很多類似的錯誤,RuntimeError: cuda runtime error (59) : device-side assert triggered,他們都是因為label小于0或者大于分類數量,隻要簡單地label = label -1或者+1就能解決問題。

然而我的label列印出來後并沒有問題,範圍是0—99(cifar100資料集有100個分類)

print(label_var.data)
==>tensor([47, 22, 50,  8, 24, 43, 25, 51, 60, 30, 54, 65, 58, 88, 20, 64, 83, 83,
           17, 60, 75, 68, 88, 24, 25, 65, 30, 99, 51, 95, 69, 49, 50,  7, 74, 66,
           33, 33,  0, 49, 74, 38, 39, 11, 12, 32, 74, 63, 25, 84, 94, 82, 98, 12,
           58, 15,  1, 77, 81, 22, 81, 11, 42, 94], device='cuda:0')
           

翻了幾十頁百度和各大論壇文章,終于在這一篇文章中找到了契機:sunflower_sara的文章

經過仔細翻查作者的分類網絡結構,發現了問題所在

if args.data.startswith('cifar100'):
    self.classifier.append(
        self._build_classifier_cifar(nIn * args.grFactor[-1], 100))
elif args.data.startswith('cifar10'):
    self.classifier.append(
        self._build_classifier_cifar(nIn * args.grFactor[-1], 10))
elif args.data == 'ImageNet':
    self.classifier.append(
        self._build_classifier_imagenet(nIn * args.grFactor[-1], 1000))
else:
    raise NotImplementedError
           

後面的10,100,1000就是網絡的分類輸出,可以肯定作者的代碼預設用的是cifar10的分類,是以我跑cifar100就出錯了。然後我去args.py裡把cifar10改成cifar100就能完美運作了。

Python新手小白,歡迎各位留言評論,一起學習進步

繼續閱讀