tensorflow的架构

tensorflow设计用于支持大规模分布式训练和预测，同时也足够灵活支持新模型和系统级优化。

tensorflow是一种跨平台库，不同级别的用户级别代码（python c++ client）通过C API和核心的运行代码隔离。

tensorflow的架构

说明：

client
- 实现用户定义数据流图(dataflow graph)，并通过tf.session来执行；
Distributed Master
- 通过Session.run()裁剪子图(sub graph)；
- 把子图分成多片(piece)，运行在不同process/device上；
- 把分片分布到不同worker service上运行；
- 通过worker service启动不同片的执行；
Worker Services (每个task一个)
- 通过适配可用的硬件(CPUs, GPUs)的kernel implementations安排graph ops的执行
- 和其他worker ops交互结果(发送or接收)

tensorflow的架构

上图展示了分布式环境中不同组件之间的交互，PS表示参数服务(parameter server)：用来存储更新模型参数的，其他task用来优化这些参数并将更新发送给PS，这种划分在分布式训练上比较有用。

client

用户可以用多种语言，低级别高级别等API来开发client计算图。

客户端创建tf.session，然后把graph的定义发给Distributed Master，是通过tf.GraphDef这种proto buffer协议发送的。当client计算图中节点的值时候，是通过触发Distributed Master调用来进行计算的。

tensorflow的架构

如上图，假设定义了一个 S = w ∗ x + b S = w * x + b S=w∗x+b的计算图；（w、x和b都是vector）

Distributed master

tensorflow的架构

上面说过，master裁剪子图并分成片来计算。

Since the master sees the overall computation for a step, it applies standard optimizations such as common subexpression elimination and constant folding. It then coordinates execution of the optimized subgraphs across a set of tasks.

下图表示了示例graph的一种可能划分，master把模型相关参数w，b都划分为一组以便放在把他们一起放在ps上。

tensorflow的架构

当图的边被partition切开后，master会插入send和receive节点到分布式task中，如下图所示红点是send节点，黄点是receive节点：

tensorflow的架构

划分完成后，master把不同的图片段（piece）发送到分布式task上去执行，如下图所示：

tensorflow的架构

可以查看master 的API定义，以及对应生成的接口。

Worker Service

每个task中的worker service主要：

处理master的请求；
调度本地子图的执行；
直接处理task之间的通信；

worker service的实现能每秒支持成千的子图执行，让大量的图replica可以进行快速细粒度训练。

同时支持单个task上每个源设备和目标设备(device)之间的Send和Recv操作，本地CPU和GPU之间通过cudaMemcpyAsync() API来支持同时计算和数据传输，本地GPU之间通过点到点的DMA通信。

对于task之间的通信，tensorflow支持gRPC、RDMA。

如下图所示：

tensorflow的架构

Kernel Implementations

tensorflow支持了200多种标准ops，数学上的，数组操作，控制流和状态管理操作，每个都有为各种设备优化的实现。

Many of the operation kernels are implemented using Eigen::Tensor, which uses C++ templates to generate efficient parallel code for multicore CPUs and GPUs; however, we liberally use libraries like cuDNN where a more efficient kernel implementation is possible. We have also implemented quantization, which enables faster inference in environments such as mobile devices and high-throughput datacenter applications, and use the gemmlowp low-precision matrix library to accelerate quantized computation.

code

kernel接口

来源：tensorflow官网architecture

tensorflow的架构

client

Distributed master

Worker Service

Kernel Implementations

继续阅读

anaconda下镜像快速安装tensorflow和keras

anaconda中科大镜像

安装tensorflow1.12出现illegal hardware instruction python错误1、问题2、定位问题3、问题解决4、验证

Linux下Anaconda安装tensorflow-gpu

tensorflow笔记实践：正则化优化过拟合

TensorFlow运行模型——会话

【Ubuntu-Tensorflow】TF1.0到TF1.2出现“Key LSTM/basic_lstm_cell/bias not found in checkpoin”问题

linux下的conda安装tensorflow

Linux环境下 TensorFlow的安装和使用基于Anaconda的tensorflow安装

MindSpore保存模型的格式疑惑

【Tensorflow】Tensorflow介绍

鸢尾花分类

利用tensorflow构建AlexNet模型，实现小数量级的猫狗分类（只有train）

ImportError: libcublas.so.10.0: cannot open shared object file: No such file解决方法

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory（完美解决）

一种解决思路： ImportError: libcublas.so.10.0: cannot open shared object file: No such file