訓練ImageNet記錄
AlexNet
Batch Size | Processor | GPU Interconnect | Time | Top-1 Accuracy | |
You et al. | 512 | DGX-1 station | NVLink | 6 hours 10 mins | 58.80% |
You et al. | 32K | CPU x 1024 | - | 11 mins | 58.60% |
Jia et al. | 64K | Pascal GPU x 1024 | 100 Gbps | 5 mins | 58.80% |
Jia et al. | 64K | Pascal GPU x 2048 | 100 Gbps | 4 mins | 58.70% |
Sun et al.(DenseCommu) | 64K | Volta GPU x 512 | 56 Gbps | 2.6 mins | 58.70% |
Sun et al.(SparseCommu) | 64K | Volta GPU x 512 | 56 Gbps | 1.5 mins | 58.20% |
ResNet50
Batch Size | Processor | GPU Interconnect | Time | Top-1 Accuracy | |
Goyal et al. | 8K | Pascal GPU x 256 | 56 Gbps | 1 hour | 76.30% |
Smith et al. | 16K | Full TPU Pod | - | 30 mins | 76.10% |
Codreanu et al. | 32K | KNL x 1024 | - | 42 mins | 75.30% |
You et al. | 32K | KNL x 2048 | - | 20 mins | 75.40% |
Akiba et al. | 32K | Pascal GPU x 1024 | 56 Gbps | 15 mins | 74.90% |
Jia et al. | 64K | Pascal GPU x 1024 | 100 Gbps | 8.7 mins | 76.20% |
Jia et al. | 64K | Pascal GPU x 2048 | 100 Gbps | 6.6 mins | 75.80% |
Mikami et al. | 68K | Volta GPU x2176 | 200 Gbps | 3.7 mins | 75.00% |
Ying et al. | 32K | TPU v3 x 1024 | - | 2.2 mins | 76.30% |
Sun et al. | 64K | Volta GPU x 512 | 56 Gbps | 7.3 mins | 75.30% |
allreduce架構:
1. hierarchical allreduce: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes
2. 2D-Torus by sony: ImageNet/ResNet-50 Training in 224 Seconds
中文解讀:224秒!ImageNet上訓練ResNet-50最佳戰績出爐,索尼下血本破紀錄
3. 2D-Torus by google: Image Classification at Supercomputer Scale
中文解讀:谷歌重新整理世界紀錄!2分鐘搞定ImageNet訓練
4. topology-aware: BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning
加速相關
1. AdamW and Super-convergence is now the fastest way to train neural nets
中文解讀:目前訓練神經網絡最快的方式:AdamW優化算法+超級收斂
Fixing Weight Decay Regularization in Adam
2. SmoothOut:Smoothing Out Sharp Minima to Improve Generalization in Deep Learning
3. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
4. Cyclical Learning Rates for Training Neural Networks
中文解讀:https://blog.csdn.net/guojingjuan/article/details/53200776
5. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
PPO相關:
1. Proximal Policy Optimization Algorithms
2. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?
2. An Empirical Model of Large-Batch Training
其他:
1. Gradient Harmonized Single-stage Detector
中文解讀:梯度協調單級探測器 http://tongtianta.site/paper/8075