tensorflow的架構

tensorflow設計用于支援大規模分布式訓練和預測，同時也足夠靈活支援新模型和系統級優化。

tensorflow是一種跨平台庫，不同級别的使用者級别代碼（python c++ client）通過C API和核心的運作代碼隔離。

tensorflow的架構

說明：

client
- 實作使用者定義資料流圖(dataflow graph)，并通過tf.session來執行；
Distributed Master
- 通過Session.run()裁剪子圖(sub graph)；
- 把子圖分成多片(piece)，運作在不同process/device上；
- 把分片分布到不同worker service上運作；
- 通過worker service啟動不同片的執行；
Worker Services (每個task一個)
- 通過适配可用的硬體(CPUs, GPUs)的kernel implementations安排graph ops的執行
- 和其他worker ops互動結果(發送or接收)

tensorflow的架構

上圖展示了分布式環境中不同元件之間的互動，PS表示參數服務(parameter server)：用來存儲更新模型參數的，其他task用來優化這些參數并将更新發送給PS，這種劃分在分布式訓練上比較有用。

client

使用者可以用多種語言，低級别進階别等API來開發client計算圖。

用戶端建立tf.session，然後把graph的定義發給Distributed Master，是通過tf.GraphDef這種proto buffer協定發送的。當client計算圖中節點的值時候，是通過觸發Distributed Master調用來進行計算的。

tensorflow的架構

如上圖，假設定義了一個 S = w ∗ x + b S = w * x + b S=w∗x+b的計算圖；（w、x和b都是vector）

Distributed master

tensorflow的架構

上面說過，master裁剪子圖并分成片來計算。

Since the master sees the overall computation for a step, it applies standard optimizations such as common subexpression elimination and constant folding. It then coordinates execution of the optimized subgraphs across a set of tasks.

下圖表示了示例graph的一種可能劃分，master把模型相關參數w，b都劃分為一組以便放在把他們一起放在ps上。

tensorflow的架構

當圖的邊被partition切開後，master會插入send和receive節點到分布式task中，如下圖所示紅點是send節點，黃點是receive節點：

tensorflow的架構

劃分完成後，master把不同的圖檔段（piece）發送到分布式task上去執行，如下圖所示：

tensorflow的架構

可以檢視master 的API定義，以及對應生成的接口。

Worker Service

每個task中的worker service主要：

處理master的請求；
排程本地子圖的執行；
直接處理task之間的通信；

worker service的實作能每秒支援成千的子圖執行，讓大量的圖replica可以進行快速細粒度訓練。

同時支援單個task上每個源裝置和目标裝置(device)之間的Send和Recv操作，本地CPU和GPU之間通過cudaMemcpyAsync() API來支援同時計算和資料傳輸，本地GPU之間通過點到點的DMA通信。

對于task之間的通信，tensorflow支援gRPC、RDMA。

如下圖所示：

tensorflow的架構

相關code

WorkerService API definition

Worker interface

Remote rendezvous (for Send and Recv implementations)

Kernel Implementations

tensorflow支援了200多種标準ops，數學上的，數組操作，控制流和狀态管理操作，每個都有為各種裝置優化的實作。

Many of the operation kernels are implemented using Eigen::Tensor, which uses C++ templates to generate efficient parallel code for multicore CPUs and GPUs; however, we liberally use libraries like cuDNN where a more efficient kernel implementation is possible. We have also implemented quantization, which enables faster inference in environments such as mobile devices and high-throughput datacenter applications, and use the gemmlowp low-precision matrix library to accelerate quantized computation.

code

kernel接口

來源：tensorflow官網architecture

tensorflow的架構

client

Distributed master

Worker Service

Kernel Implementations

繼續閱讀

anaconda下鏡像快速安裝tensorflow和keras

anaconda中科大鏡像

安裝tensorflow1.12出現illegal hardware instruction python錯誤1、問題2、定位問題3、問題解決4、驗證

Linux下Anaconda安裝tensorflow-gpu

tensorflow筆記實踐：正則化優化過拟合

TensorFlow運作模型——會話

【Ubuntu-Tensorflow】TF1.0到TF1.2出現“Key LSTM/basic_lstm_cell/bias not found in checkpoin”問題

linux下的conda安裝tensorflow

Linux環境下 TensorFlow的安裝和使用基于Anaconda的tensorflow安裝

MindSpore儲存模型的格式疑惑

【Tensorflow】Tensorflow介紹

鸢尾花分類

利用tensorflow建構AlexNet模型，實作小數量級的貓狗分類（隻有train）

ImportError: libcublas.so.10.0: cannot open shared object file: No such file解決方法

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory（完美解決）

一種解決思路： ImportError: libcublas.so.10.0: cannot open shared object file: No such file