天天看點

幾篇論文

訓練ImageNet記錄

AlexNet

Batch Size Processor GPU Interconnect Time Top-1 Accuracy
You et al. 512 DGX-1 station  NVLink 6 hours 10 mins 58.80%
You et al. 32K CPU x 1024  -  11 mins 58.60%
Jia et al. 64K Pascal GPU x 1024  100 Gbps 5 mins 58.80%
Jia et al. 64K Pascal GPU x 2048 100 Gbps 4 mins 58.70%
Sun et al.(DenseCommu) 64K Volta GPU x 512 56 Gbps 2.6 mins 58.70%
Sun et al.(SparseCommu) 64K Volta GPU x 512 56 Gbps 1.5 mins 58.20%

ResNet50

Batch Size Processor GPU Interconnect Time Top-1 Accuracy
Goyal et al. 8K Pascal GPU x 256 56 Gbps 1 hour 76.30%
Smith et al. 16K Full TPU Pod   -  30 mins 76.10%
Codreanu et al. 32K KNL x 1024  -  42 mins 75.30%
You et al. 32K KNL x 2048  -  20 mins 75.40%
Akiba et al. 32K Pascal GPU x 1024  56 Gbps 15 mins 74.90%
Jia et al. 64K Pascal GPU x 1024  100 Gbps 8.7 mins 76.20%
Jia et al. 64K Pascal GPU x 2048 100 Gbps 6.6 mins 75.80%
Mikami et al. 68K Volta GPU x2176 200 Gbps 3.7 mins 75.00%
Ying et al. 32K TPU v3 x 1024  -  2.2 mins 76.30%
Sun et al. 64K Volta GPU x 512 56 Gbps 7.3 mins 75.30%

allreduce架構:

1.  hierarchical allreduce: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes

2. 2D-Torus by sony: ImageNet/ResNet-50 Training in 224 Seconds   

  中文解讀:224秒!ImageNet上訓練ResNet-50最佳戰績出爐,索尼下血本破紀錄

3. 2D-Torus by google: Image Classification at Supercomputer Scale

   中文解讀:谷歌重新整理世界紀錄!2分鐘搞定ImageNet訓練

4.  topology-aware: BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning

加速相關

1. AdamW and Super-convergence is now the fastest way to train neural nets

   中文解讀:目前訓練神經網絡最快的方式:AdamW優化算法+超級收斂

   Fixing Weight Decay Regularization in Adam

2. SmoothOut:Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

3. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

4. Cyclical Learning Rates for Training Neural Networks

    中文解讀:https://blog.csdn.net/guojingjuan/article/details/53200776

5. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

PPO相關:

1. Proximal Policy Optimization Algorithms

2. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?

2. An Empirical Model of Large-Batch Training

其他:

1. Gradient Harmonized Single-stage Detector

    中文解讀:梯度協調單級探測器 http://tongtianta.site/paper/8075