天天看点

CAFFE源码学习笔记之三-common

一、前言

Syncedmemory类包含的头文件只有一个common.hpp。而common也是整个caffe的基础设施之一,主要作用是管理全局的资源。在caffe中的全局资源有哪些呢?

根据代码,可以总结为:

1、随机数;

2、GPU设备信息;

3、并行的训练时的solver_rank等。

caffe就是将这些全局信息用全局静态变量表示,看起简直有点too young too simple。

那么问题是什么?简单,就是多线程下对全局资源的争夺。对付这种race我知道可以从两处入手:

1、阻塞,让一个线程访问的时候,其他线程休眠;

2、加锁,就是让原本并行的访问变成穿行的访问。

但是这两种都是降低多线程效率的方法,caffe则引入了boost中的局部显存存储。具体地跟着代码走两步。

二、源码分析

首先看包含关系,因为很多…….

#include <boost/shared_ptr.hpp>//boost
#include <gflags/gflags.h>
#include <glog/logging.h>

#include <climits>
#include <cmath>
#include <fstream>  // NOLINT(readability/streams)
#include <iostream>  // NOLINT(readability/streams)
#include <map>
#include <set>
#include <sstream>
#include <string>
#include <utility>  // pair
#include <vector>

#include "caffe/util/device_alternate.hpp"
           

然后是一个将宏转换为字符串的宏:

// Convert macro to string
#define STRINGIFY(m) #m
#define AS_STRING(m) STRINGIFY(m)
           

实例化类,据说是和模板头文件和实现分离有关,模板我不熟,尚待进一步研究:

// Instantiate a class with float and double specifications.
#define INSTANTIATE_CLASS(classname) \
  char gInstantiationGuard##classname; \
  template class classname<float>; \
  template class classname<double>

#define INSTANTIATE_LAYER_GPU_FORWARD(classname) \
  template void classname<float>::Forward_gpu( \
      const std::vector<Blob<float>*>& bottom, \
      const std::vector<Blob<float>*>& top); \
  template void classname<double>::Forward_gpu( \
      const std::vector<Blob<double>*>& bottom, \
      const std::vector<Blob<double>*>& top);

#define INSTANTIATE_LAYER_GPU_BACKWARD(classname) \
  template void classname<float>::Backward_gpu( \
      const std::vector<Blob<float>*>& top, \
      const std::vector<bool>& propagate_down, \
      const std::vector<Blob<float>*>& bottom); \
  template void classname<double>::Backward_gpu( \
      const std::vector<Blob<double>*>& top, \
      const std::vector<bool>& propagate_down, \
      const std::vector<Blob<double>*>& bottom)

#define INSTANTIATE_LAYER_GPU_FUNCS(classname) \
  INSTANTIATE_LAYER_GPU_FORWARD(classname); \
  INSTANTIATE_LAYER_GPU_BACKWARD(classname)

           

这里可以看到common确实是基础设施的感觉,把整个系统的命名空间都给申明了:

// Common functions and classes from std that caffe often uses.
using std::fstream;
using std::ios;
using std::isnan;
using std::isinf;
using std::iterator;
using std::make_pair;
using std::map;
using std::ostringstream;
using std::pair;
using std::set;
using std::string;
using std::stringstream;
using std::vector;
           

下面就是common定义的Caffe类,该类负责实际管理全局资源:

// 全局初始化函数,在main函数中调用的,主要是初始化gfags和glogs
void GlobalInit(int* pargc, char*** pargv);


// 单例模式,定义Caffe类
class Caffe {
 public:
  ~Caffe();

  // 利用局部线程存储,保证每个线程只有一个实例
  static Caffe& Get();

  enum Brew { CPU, GPU };


  //随机数类,关键的在 class Generator
  class RNG {
   public:
    RNG();
    explicit RNG(unsigned int seed);
    explicit RNG(const RNG&);
    RNG& operator=(const RNG&);
    void* generator();
   private:
    class Generator;
    shared_ptr<Generator> generator_;
  };

  // 获得随机数、cublas_handle等全局资源
  inline static RNG& rng_stream() {
    if (!Get().random_generator_) {
      Get().random_generator_.reset(new RNG());
    }
    return *(Get().random_generator_);
  }
#ifndef CPU_ONLY
  inline static cublasHandle_t cublas_handle() { return Get().cublas_handle_; }
  inline static curandGenerator_t curand_generator() {
    return Get().curand_generator_;
  }
#endif

  // Returns the mode: running on CPU or GPU.
  inline static Brew mode() { return Get().mode_; }
  // The setters for the variables
  // Sets the mode. It is recommended that you don't change the mode halfway
  // into the program since that may cause allocation of pinned memory being
  // freed in a non-pinned way, which may cause problems - I haven't verified
  // 作者建议不要中途更换mode,因为这可能引起锁页内存被用分页内存的方式释放,这样有问题。
  inline static void set_mode(Brew mode) { Get().mode_ = mode; }
  // Sets the random seed of both boost and curand
  static void set_random_seed(const unsigned int seed);
  // Sets the device. Since we have cublas and curand stuff, set device also
  // requires us to reset those values.
  static void SetDevice(const int device_id);
  // Prints the current GPU status.
  static void DeviceQuery();
  // Check if specified device is available
  static bool CheckDevice(const int device_id);
  // Search from start_id to the highest possible device ordinal,
  // return the ordinal of the first available device.
  static int FindDevice(const int start_id = );
  // Parallel training
  inline static int solver_count() { return Get().solver_count_; }
  inline static void set_solver_count(int val) { Get().solver_count_ = val; }
  inline static int solver_rank() { return Get().solver_rank_; }
  inline static void set_solver_rank(int val) { Get().solver_rank_ = val; }
  inline static bool multiprocess() { return Get().multiprocess_; }
  inline static void set_multiprocess(bool val) { Get().multiprocess_ = val; }
  inline static bool root_solver() { return Get().solver_rank_ == ; }

 protected:
#ifndef CPU_ONLY
  cublasHandle_t cublas_handle_;
  curandGenerator_t curand_generator_;
#endif
  shared_ptr<RNG> random_generator_;

  Brew mode_;

  // Parallel training
  int solver_count_;
  int solver_rank_;
  bool multiprocess_;

 private:
  // 禁止构造函数,防止出现相同实例
  Caffe();

  DISABLE_COPY_AND_ASSIGN(Caffe);//禁止复制和赋值
};

} 
           

实现

首先,看局部线程存储(TSL)的应用。

(1)局部线程存储机制:

对于共享资源,TSL保证每个线程拥有一个资源的副本,然后允许各线程访问各自对应的副本,最后再将副本合并。

static boost::thread_specific_ptr<Caffe> thread_instance_;//智能指针的一种,为了实现TSL

Caffe& Caffe::Get() {
  if (!thread_instance_.get()) {//确保该线程内没有Caffe实例,再创建新实例。
    thread_instance_.reset(new Caffe());
  }
  return *(thread_instance_.get());//返回的是实例而不再是指针了。智能指针的get()将使变量脱离智能指针的控制
}
           

(2)随机数

1、随机数种子可以由熵获得,或者是根据时间。

// 随机数种子
int64_t cluster_seedgen(void) {
  int64_t s, seed, pid;
  FILE* f = fopen("/dev/urandom", "rb");
  if (f && fread(&seed, , sizeof(seed), f) == sizeof(seed)) {
    fclose(f);
    return seed;
  }

  LOG(INFO) << "System entropy source not available, "
              "using fallback algorithm to generate seed instead.";
  if (f)
    fclose(f);

  pid = getpid();
  s = time(NULL);
  seed = std::abs(((s * ) * ((pid - ) * )) % );//时间加进程号,再加一堆混乱的数????
  return seed;
}
           

2、随机数产生

Caffe内由RNG类,RNG类有Generator类

实际是boost::mt19937。

class Caffe::RNG::Generator {
 public:
  Generator() : rng_(new caffe::rng_t(cluster_seedgen())) {}//typedef boost::mt19937 rng_t;
  explicit Generator(unsigned int seed) : rng_(new caffe::rng_t(seed)) {}
  caffe::rng_t* rng() { return rng_.get(); }
 private:
  shared_ptr<caffe::rng_t> rng_;
};
           

(3)GPU设备管理

这里主要是cuda API

void Caffe::SetDevice(const int device_id) {
  int current_device;
  CUDA_CHECK(cudaGetDevice(&current_device));//current_device表示现在GPU的个数
  if (current_device == device_id) {
    return;//从0标志ID的,所以current_device<device_id
  }

  //
  CUDA_CHECK(cudaSetDevice(device_id));
  if (Get().cublas_handle_) CUBLAS_CHECK(cublasDestroy(Get().cublas_handle_));
  if (Get().curand_generator_) {
    CURAND_CHECK(curandDestroyGenerator(Get().curand_generator_));
  }
  CUBLAS_CHECK(cublasCreate(&Get().cublas_handle_));
  CURAND_CHECK(curandCreateGenerator(&Get().curand_generator_,
      CURAND_RNG_PSEUDO_DEFAULT));
  CURAND_CHECK(curandSetPseudoRandomGeneratorSeed(Get().curand_generator_,
      cluster_seedgen()));
}

void Caffe::DeviceQuery() {
  cudaDeviceProp prop;//GPU property
  int device;
  if (cudaSuccess != cudaGetDevice(&device)) {
    printf("No cuda device present.\n");
    return;
  }
  CUDA_CHECK(cudaGetDeviceProperties(&prop, device));
  LOG(INFO) << "Device id:                     " << device;//设备ID
  LOG(INFO) << "Major revision number:         " << prop.major;
  LOG(INFO) << "Minor revision number:         " << prop.minor;
  LOG(INFO) << "Name:                          " << prop.name;
  LOG(INFO) << "Total global memory:           " << prop.totalGlobalMem;//全局内存大小
  LOG(INFO) << "Total shared memory per block: " << prop.sharedMemPerBlock;//每个block的共享内存的大小
  LOG(INFO) << "Total registers per block:     " << prop.regsPerBlock;//每个block的寄存器数量
  LOG(INFO) << "Warp size:                     " << prop.warpSize;//束的大小
  LOG(INFO) << "Maximum memory pitch:          " << prop.memPitch;//pitch用于二位数组
  LOG(INFO) << "Maximum threads per block:     " << prop.maxThreadsPerBlock;//,每个block的线程最大数
  LOG(INFO) << "Maximum dimension of block:    "
      << prop.maxThreadsDim[] << ", " << prop.maxThreadsDim[] << ", "
      << prop.maxThreadsDim[];//线程的维度
  LOG(INFO) << "Maximum dimension of grid:     "
      << prop.maxGridSize[] << ", " << prop.maxGridSize[] << ", "
      << prop.maxGridSize[];//block的维度
  LOG(INFO) << "Clock rate:                    " << prop.clockRate;
  LOG(INFO) << "Total constant memory:         " << prop.totalConstMem;//最大常量内存
  LOG(INFO) << "Texture alignment:             " << prop.textureAlignment;//纹理????
  LOG(INFO) << "Concurrent copy and execution: "
      << (prop.deviceOverlap ? "Yes" : "No");
  LOG(INFO) << "Number of multiprocessors:     " << prop.multiProcessorCount;//sm的数量
  LOG(INFO) << "Kernel execution timeout:      "
      << (prop.kernelExecTimeoutEnabled ? "Yes" : "No");
  return;
}

bool Caffe::CheckDevice(const int device_id) {

 //cudaSetdevice只是显示id,并没有创建相应的上下文,不保险。
  //
  // In a shared environment where the devices are set to EXCLUSIVE_PROCESS
  // or EXCLUSIVE_THREAD mode, cudaSetDevice() returns cudaSuccess
  // even if the device is exclusively occupied by another process or thread.
  // Cuda operations that initialize the context are needed to check
  // the permission. cudaFree(0) is one of those with no side effect,
  // except the context initialization.
  bool r = ((cudaSuccess == cudaSetDevice(device_id)) &&
            (cudaSuccess == cudaFree()));
  // reset any error that may have occurred.
  cudaGetLastError();
  return r;
}

int Caffe::FindDevice(const int start_id) {
  // This function finds the first available device by checking devices with
  // ordinal from start_id to the highest available value. In the
  // EXCLUSIVE_PROCESS or EXCLUSIVE_THREAD mode, if it succeeds, it also
  // claims the device due to the initialization of the context.
  int count = ;
  CUDA_CHECK(cudaGetDeviceCount(&count));
  for (int i = start_id; i < count; i++) {
    if (CheckDevice(i)) return i;
  }
  return -;
}
           

三、总结

全局资源的管理,随机数、设备信息、并行训练信息