參考： C o m p u t e r A r i c h i t e c t u r e ( 6 th ⁡ E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

Multiprocessing
- Classification of computer architecture
- Challenges to Parallel Programming
SIMD (Data-Level Parallelism)
- Vector Processor
- - Why Vector Processors?
  - Basic Vector Architecture
  - Vector Supercomputers
  - Multimedia Extensions (aka SIMD extensions)
  - The basic structure of a vector-register architecture
  - - VMIPS
    - Vector Code Example
    - Automatic Code Vectorization
    - Vector Arithmetic Execution (deep pipeline + multiple independent memory banks)
    - Vector Stripmining
    - Vector Stride
    - Vector Chaining
    - Multiple Lanes
- Array Processor
- *GPU
MIMD (Thread-Level Parallelism)
- Communication Models
- - Shared Memory Model
  - Message Passing
- MIMD Memory Architecture: Centralized (SMP) vs. Distributed
- The Flynn-Johnson classification of computer systems
- Typical parallel computer architectures
- - 對稱多處理機 SMP（Symmetric Multiprocessor）
  - 工作站機群 COW (Cluster of Workstation)
  - 大規模并行處理機 MPP (Massively Parallel Processor)
  - Cluster vs. MPP
- Interconnection Networks
Cache Coherence and Coherence Protocol
- What Is Multiprocessor Cache Coherence?
- - Cache Coherence problem
  - Coherent Memory Model
  - - Coherent Memory System
    - Memory Consistency Model
- Basic Schemes for Enforcing Coherence
- - Cache Coherence Protocols (HW)
- Snooping Coherence Protocols
- - Write Invalidate, Write Update
  - Basic Implementation Techniques (Write Invalidate)
  - - Broadcast Medium Transactions (e.g., bus)
    - Locate up-to-date copy of data
- An Example Protocol (Snoopy, Invalidate)
- - Write-through Cache Protocol
  - Write Back Cache Protocol

Multiprocessing

subprocessor: 協處理器

Classification of computer architecture

Flynn’s Taxonomy (Flynn 分類法):

基本思想：計算機工作過程是指令流的執行和資料流的處理。根據指令流和資料流的多倍性對計算機系統結構進行分類
- 指令流：機器執行的指令序列
- 資料流：由指令流調用的資料序列（包括輸入資料和中間結果）
- 多倍性：在系統性能的瓶頸部件上處于同一執行階段的指令或資料的最大個數 (一個指令最多操作多少資料)

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

SISD: 單指令流單資料流；Uniprocessor 的工作過程
- 馮諾依曼架構采用的方式
SIMD (Data Level Parallelism)：單指令流 → \rightarrow → 一台機器上運作；但可以同時操縱多個資料
- 向量機 Vector processer、陣列機 Array processor
MISD: 實際不存在
MIMD (Thread (Process) Level Parallelism 線程/程序級并行)：多指令流 → \rightarrow → 多道程式 → \rightarrow → 多處理器；每個指令流處理自己的資料
- Multi-computers: 叢集 Clusters
- Multi-processors: 一台機器多 CPU、多核 (Multi-core)
  - Reason for multicores: physical limitations can cause significant heat dissipation and data synchronization problems
  - e.g. Intel Core 2 dual core processor, with CPU-local Level 1 caches+ shared, on-die Level 2 cache. (多核之間共享 L2 Cache，資料同步相對好處理一點)
    
    Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

SISD v. SIMD v. MIMD

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

PU: 處理單元；Instruction Pool: 指令 Cache；Data Pool: 資料 Cache

Challenges to Parallel Programming

First challenge is % of program inherently sequential

primarily via new algorithms that have better parallel performance

Example

Suppose 80X speedup from 100 processors. (100個處理器達到80倍的加速比) What fraction of original program can be sequential?

a.10% b. 5% c.1% d.<1%
Amdahl’s Law Answers: <1%

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Second challenge is long latency to remote memory

both by architect and by the programmer. For example, reduce frequency of remote accesses either by
- Caching shared data (HW)
- Restructuring the data layout to make more accesses local (SW)

Example

Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. What is performance impact if 0.2% instructions involve remote access?

a. 1.5X b. 2.0X c. 2.6X
CPI Equation
- Remote access cost = 200ns × \times × 2GHz = 400 clock cycles.
- CPI = Base CPI + Remote request rate x Remote request cost = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
- No communication (the MP with all local reference) is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access

SIMD (Data-Level Parallelism)

Vector Processor

Why Vector Processors?

A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop. Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
The computation of each result (for one element) in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors.
Vector instructions that access memory have a known access pattern (數組在記憶體中的存儲模式是固定的，是以訪存操作也是固定的).

Basic Vector Architecture

vector-register processors

In a vector-register processor, all vector operations—except load and store—are among the vector registers. (向量寄存器 → \rightarrow → 必須足夠大)

memory-memory vector processors

In a memory-memory vector processor, all vector operations are memory to memory.

Vector Memory-Memory vs. Vector Register Machines

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

指令後加 V V V 表示向量指令

Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
- All operands must be read in and out of memory
VMMAs make it difficult to overlap execution of multiple vector operations, why?
- Must check dependencies on memory addresses
VMMAs incur greater startup latency → \rightarrow → All major vector machines since Cray-1 have had vector register architectures (we ignore vector memory-memory from now on)
- Scalar code was faster on CDC Star-100 (memory-memory) for vectors < 100 elements
- For Cray-1 (vector registor), vector/scalar breakeven point was around 2 elements

Vector Supercomputers

Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions

Load/Store Architecture, Vector Registers, Vector Instructions, Hardwired Control, Interleaved Memory System, Highly Pipelined Functional Units (其實向量指令中每個資料的計算并不是并行的一起做，而是采用高速流水的方式進行 (高速是因為不用檢測向量内部的相關性))

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Vector Programming Model

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Stride: 例如二維數組按列取數時就要用到 stride

Multimedia Extensions (aka SIMD extensions)

目前 CPU 裡內建的一般是多媒體擴充，類似于向量操作

Very short vectors added to existing ISAs for microprocessors. Use existing 64-bit registers split into 2 × 32 2\times32 2×32b or 4 × 16 4\times16 4×16b or 8 × 8 8\times8 8×8b (Newer designs have wider registers)

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
Single instruction operates on all elements within register

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multimedia Extensions versus Vectors

Limited instruction set: no vector length control, no strided load/store or scatter/gather, unit-stride loads must be aligned to 64/128-bit boundary
Limited vector register length: requires superscalar dispatch to keep multiply/add/load units busy, loop unrolling to hide latencies increases register pressure
Trend towards fuller vector support in microprocessors: Better support for misaligned memory accesses; Support of double-precision (64-bit floating-point); New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b)

The basic structure of a vector-register architecture

VMIPS

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Primary Components of VMIPS

Vector registers — VMIPS has eight vector registers, and each holds 64 elements. Each vector register must have at least two read ports and one write port.
Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle.
- In VMIPS, vector operations use the same names as MIPS operations, but with the letter “ V V V” appended.
Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency.
A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit.

Vector Code Example

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

VLR: vector length register

Automatic Code Vectorization

Scalar Sequential Code:

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
Vectorized Code: Vectorization is a massive compile-time reordering of operation sequencing → \rightarrow → requires extensive loop dependence analysis

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Vector Arithmetic Execution (deep pipeline + multiple independent memory banks)

用一條指令(向量指令)發起對整個向量中的所有元素的訪存操作并流水化處理
- To produce results every clock multiple memory: multiple memory banks are used
  - Allow multiple loads and stores per clock cycle
  - Allow for independent management of different memory addresses
  - Example: assume 8 memory banks and 6 cycles of memory bank time to deliver a data item – Overlapping of multiple data requests by the hardware
    
    Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
- Use deep pipeline (=> fast clock) to execute element operations. Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Vector Stripmining

分段開采

Problem: Vector registers have finite length
- The solution is to create a vector-length register (VLR), which controls the length of any vector operation. The value in the VLR, however, cannot be greater than the length of the vector registers— maximum vector length (MVL).
- If the vector is longer than the maximum length, stripmining is used. (Break loops into pieces that fit into vector registers)
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Vector Stride

跨距

At the statement we could vectorize the multiplication of each row of B B B with each column of C C C. When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. 總有一種是不連續的

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
This distance separating elements that are to be gathered into a single register is called the stride.
The vector stride, like the vector starting address, can be put in a general-purpose register. Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register. Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used.

Vector Chaining

Vector version of register bypassing: the Concept of Forwarding Extended to Vector Registers, making a sequence of dependent vector operations run faster (因為向量中元素的運算是用 pipeline 完成的，是以隻要前一個向量指令的向量結果中有一個元素被計算出來，就可以送到下一條向量指令中進行運算)

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multiple Lanes

多航道 / 多車道

increase the peak performance of a vector machine by adding more parallel execution units

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Array Processor

陣列機 (SIMD)

Basic idea:

A single control unit (并非是多核) provides the signals to drive many Processing Elements (Run in lockstep 前後緊接,步伐一緻的前進) (PE 執行一樣的指令，不一樣的資料 → \rightarrow → SIMD)
- PE 組成：各部件受控制部件控制；控制部件受指令控制總線控制
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

*GPU

GPU / CPU架構比較

The GPU Devotes More Transistors to Data Processing.
CPU：更多資源用于緩存和控制

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

CUDA 高速運算的基礎

CUDA (Compute Unified Device Architecture) → \rightarrow → 程式設計接口

計算一緻性 (computing coherence)
- 單程式、多資料執行模式 (SPMD) (不完全是 SIMD): 每一個核都可以執行程式中的一段代碼，形成一個微小的線程放到 ALU 上執行 (依靠上千個核進行并行運算) (也稱為 SIMT: 單指令多線程)
- 大量并行計算資源: Thousands of CPU cores so far; Thousands of threads on the fly
隐藏存儲器延時: 提升計算／通信比例、合并相鄰位址的記憶體通路; 快速線程切換
- 1 [email protected] vs. ~1000 [email protected]

适合的應用

GPU 隻有在計算高度資料并行任務時才能發揮作用。在這類任務中，需要處理大量的資料，資料的儲存形式類似于規則的網格，而對這些資料的進行的處理則基本相同
- 例如：圖像處理，實體模型模拟（如計算流體力學），工程和金融模拟與分析，搜尋，排序

不适合的應用 (需要重新設計算法和資料結構或者打包處理)

需要複雜資料結構的計算如樹，相關矩陣，連結清單，空間細分結構等，則不适用于使用 GPU 進行計算
串行和事務性處理較多的程式
并行規模很小的應用，如隻有數個并行線程
需要ms量級實時性的程式

MIMD (Thread-Level Parallelism)

“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
- Parallel Architecture = Computer Architecture + Communication Architecture

Communication Models

Shared Memory Model

Multi-processors (多處理器系統): 基于共享存儲器 Shared Memory

Communication occurs through a shared virtual address space (via loads and stores) ⇒ \Rightarrow ⇒ low overhead for communication
- 唯一的位址空間并不意味着在實體上隻有一個存儲器。共享位址空間可以通過一個實體共享的存儲器來實作 (Centralized)，也可以通過分布式存儲器在軟硬體支援下實作 (Distributed)
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

shared memory multiprocessors either

UMA (Uniform Memory Access time) for shared address, centralized memory MP
NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP

Message Passing

Multi-computers (多計算機系統): 基于消息傳遞 Message Passing

Communication occurs by explicitly passing messages among the processors. (distributed memory system)
- 每個處理器有自己的局部/私有存儲器 (private local memory)，該存儲器隻能被該處理器通路而不能被其他處理器直接通路
- The address space can consist of multiple private address spaces. Each processor-memory module is essentially a separate computer
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

MIMD Memory Architecture: Centralized (SMP) vs. Distributed

2 classes of multiprocessors with respect to memory:

(1) Centralized Memory Multiprocessor (Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors; 對稱 (針對存儲而言的)：每個處理器都以完全相同的方式訪存)
- few dozen processor chips and cores
- Small enough to share single, centralized memory (Memory 過大，則硬體過于複雜，難以管理) ⇒ \Rightarrow ⇒ needs Larger Cache
- Can scale to a few dozen processors by using a switch and by using many memory banks. Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
(2) Physically Distributed-Memory multiprocessor
- Larger number chips and cores
- BW (bandwidth) demands (Memory distributed among processors)
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

網際網路絡可以是以太網、光纖等高速網絡

The Flynn-Johnson classification of computer systems

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

橫軸是存儲器架構；縱軸是通信模型

考點；例如描述一下為什麼這麼劃分

Typical parallel computer architectures

對稱多處理機 SMP（Symmetric Multiprocessor）

Centralized Memory Multiprocessors are also called SMPs because single main memory has a symmetric relationship to all processors

工作站機群 COW (Cluster of Workstation)

A computer cluster is a group of coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer.
The components of a cluster are commonly, but not always, connected to each other through fast local area networks.
- MPI is a widely-available communications library that enables parallel programs to be written in C, Python…

Cluster categorizations

High-availability (HA) clusters 高可用性
- operate by having redundant nodes, which are then used to provide service when system components fail.
- e.g. failover clusters 故障轉移叢集
Load-balancing clusters 負載平衡
- operate by distributing a workload evenly over multiple back end nodes.
Grid/Cloud computing
- grid clusters, a technology closely related to cluster computing. The key differences are that grids connect collections of computers which do not fully trust each other, or which are geographically dispersed. (more like a computing utility than like a single computer)
- support more heterogeneous collections

大規模并行處理機 MPP (Massively Parallel Processor)

MPP 系統是由成百上千台處理機組成的大規模并行計算機系統。主要用于科學計算、工程模拟等以計算為主的場合，目前也廣泛應用于商業和網絡應用中
- 開發困難，價格高，國家綜合實力的象征 (超算)
- 使用高性能的私用的互連網絡，可以在低延時和高帶寬的條件下傳遞消息 (這是它與 COW 最大的差別)
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Cluster vs. MPP

體系結構方面的差別

(1) Cluster 的結點是更完整的計算機，計算機可以是同構的也可以是異構的。結點都有自己的磁盤，駐留有自己的完整的作業系統；一般都有一定的自主性, 脫離 Cluster 照樣能運作。而 MPP 系統結點一般沒有磁盤，隻駐留作業系統核心
(2) MPP 使用制造廠商專有（或者有專利權）的高速通信網絡，網絡接口是連到處理結點的存儲總線上（緊耦合）； Cluster 一般采用公開銷售的标準高速區域網路或系統域網，網絡通常是與結點計算機的 I/O 總線相連（松散耦合）

Interconnection Networks

Processor-to-Memory Interconnection Networks

互連網絡：由開關元件按照一定的拓撲結構和控制方式構成的網絡，實作計算機系統結點之間的互相連接配接
互連函數：反映網絡輸入數組 (Processors) 和輸出數組 (Memory banks) 之間對應的置換關系或排列關系

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

互連網絡的分類

靜态網絡 (Static Networks): 結點間有着固定連接配接通路且在程式執行期間，這種連接配接保持不變的網絡
動态網絡 (Dynamic Networks): 由開關單元構成，可按應用程式的要求動态地改變連接配接狀态 (适合于大型互連網絡). 動态網絡主要有總線、交叉開關、多級交換網絡

Connecting Multiple Computers

Shared Media vs. Switched (“point-to-point”)

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Switching scheme

Circuit switching

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
Packet switching

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multistage Network

為了構造大型網絡，可以把交叉開關級聯起來，構成多級互連網絡。作用是通過對各個交叉連接配接單元的控制來完成輸入端和輸出端之間的各種連接配接，使每個輸入端上的資訊都可以送到任何一個輸出端上去

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Cache Coherence and Coherence Protocol

多處理器緩存一緻性 (這個問題隻可能在 “Shared Memory Models” 中才可能發生)

Symmetric shared-memory machines usually support the caching of both shared and private data.
- Private data are used by a single processor
  - When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Because no other processor uses the data, the program behavior is identical to that in a uniprocessor.
- Shared data are used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data
  - When shared data are cached, the shared value may be replicated in multiple caches. In addition to the reduction in access latency and required memory bandwidth, this replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.
  - Caching of shared data, however, introduces a new problem: cache coherence.

What Is Multiprocessor Cache Coherence?

Cache Coherence problem

Because the view of memory held by two different processors is through their individual caches, the processors could end up seeing different values for the same memory location

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Notice that the coherence problem exists because we have both a global state, defined primarily by the main memory, and a local state, defined by the individual caches, which are private to each processor core. Thus, in a multi-core where some level of caching may be shared (e.g., an L3), although some levels are private (e.g., L1 and L2), the coherence problem still exists and must be solved.

Coherent Memory Model

Definition: Reading an address should return the last value written to that address
This simple definition contains two different aspects:
- (1) Coherence (一緻性) defines what values can be returned by a read (傳回給讀操作的是什麼值)
- (2) Consistency (連貫性) determines when a written value will be returned by a read (什麼時候才能将已寫入的值傳回給讀操作)
- Coherence and consistency are complementary: Coherence defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to accesses to other memory locations.

Coherent Memory System

Preserve Program Order (保持程式順序): 處理器 P P P 對 X X X 進行一次寫之後又對 X X X 進行讀，讀和寫之間沒有其它處理器對 X X X 進行寫，則讀的傳回值總是寫進的值
Coherent view of memory (一緻性存儲器視圖)：一個處理器對 X X X 進行寫之後，另一處理器對 X X X 進行讀，讀和寫之間無其它寫，則讀 X X X 的傳回值應為寫進的值 (if a processor could continuously read an old data value, we would clearly say that memory was incoherent.)
Write serialization (寫操作串行化): 對同一單元的寫是順序化的，這保證了任意兩個處理器對同一單元的兩次寫，從所有處理器看來順序都應是相同的
- For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.

Memory Consistency Model

Although the three properties just described are sufficient to ensure coherence, the question of when a written value will be seen is also important.
- To see why, observe that we cannot require that a read of X X X instantaneously see the value written for X X X by some other processor.
- The issue of exactly when a written value must be seen by a reader is defined by a memory consistency model

Memory Consistency Model

(1) A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write (直到所有的處理器均看到了寫的結果，一次寫操作才算完成)
(2) The processor does not change the order of any write with respect to any other memory access
- if a processor writes location A A A followed by location B B B, any processor that sees the new value of B B B must also see the new value of A A A
These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order (允許處理器無序讀，但必須以程式規定的順序進行寫)

We will rely on this assumption until we reach Section 5.6, where we will see exactly the implications of this definition, as well as the alternatives.

Basic Schemes for Enforcing Coherence

In a coherent multiprocessor, the caches provide both m i g r a t i o n migration migration and r e p l i c a t i o n replication replication of shared data items.
- Migration (遷移) – data can be moved to a local cache and used there in a transparent fashion. ⇒ \Rightarrow ⇒ 降低了對遠端共享資料的通路延遲及帶寬需求
- Replication (複制) – for shared data that are being simultaneously read, since caches make a copy of data in local cache. ⇒ \Rightarrow ⇒ 不僅降低了訪存的延遲，也減少了通路共享資料所産生的沖突

Cache Coherence Protocols (HW)

Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The state of any cache block is kept using status bits associated with the block, similar to the valid and dirty bits kept in a uniprocessor cache
- (1) Directory based (目錄法) — Sharing status of a block of physical memory is kept in just one location, the directory (存在瓶頸)
- (2) Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept
  - All caches are accessible via some broadcast medium (a bus or switch)
  - All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

Snooping Coherence Protocols

Write Invalidate, Write Update

Cache Controller “snoops” all transactions on the shared medium (bus or switch) ⇒ \Rightarrow ⇒ if any relevant transaction (Cache contains that block): take action to ensure coherence
- (1) Write Invalidate (寫廢棄) (Exclusive access): Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated
  
  Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol
- (2) Write Update (寫更新): update all the cached copies of a data item when that item is written ⇒ \Rightarrow ⇒ uses more broadcast medium BW ⇒ \Rightarrow ⇒ all recent MPUs use write invalidate

下面主要介紹 Write Invalidate

Basic Implementation Techniques (Write Invalidate)

Broadcast Medium Transactions (e.g., bus)

To perform an invalidate, the processor acquires bus access and broadcasts the address to be invalidated on the bus. All processors snoop on the bus, watching the addresses. The processors check whether the address on the bus is in their cache. If so, the corresponding data are invalidated.
總線機制還同時實作了寫串行化: If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus.
- One implication of this scheme is that a write to a shared data item cannot actually complete until it obtains bus access.

Locate up-to-date copy of data

Write-through cache

all written data are always sent to the memory ⇒ \Rightarrow ⇒ get up-to-date copy from memory
- Write through simpler if enough memory BW

Write-back cache

Can use same snooping mechanism:
- supply value: Snoop every address placed on the bus. If a processor has dirty (newest) copy of requested cache block, it provides it in response to a read request and aborts the memory access
Write-back needs lower memory bandwidth ⇒ \Rightarrow ⇒ Support larger numbers of faster processors ⇒ \Rightarrow ⇒ Most multiprocessors use write-back

An Example Protocol (Snoopy, Invalidate)

Snooping coherence protocol is usually implemented by incorporating a finite-state controller in each node

Write-through Cache Protocol

2 states per block in each cache (Valid / Invalid)

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Write Back Cache Protocol

Each cache block is in one state:
- Shared : block can be read
- Exclusive : cache has only copy, its writeable, and dirty
- Invalid : block contains no data (in uniprocessor cache too)

Write-Back State Machine - CPU

State machine for CPU requests for each cache block

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Writes to clean blocks are treated as misses (都是發信号到總線通知寫無效)

Write-Back State Machine- Bus request

State machine for bus requests for each cache block

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Write-back State Machine-III

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

Multiprocessing (parallel computer architecture) (多處理系統)MultiprocessingSIMD (Data-Level Parallelism)MIMD (Thread-Level Parallelism)Cache Coherence and Coherence Protocol

目錄

Multiprocessing

Classification of computer architecture

Challenges to Parallel Programming

SIMD (Data-Level Parallelism)

Vector Processor

Why Vector Processors?

Basic Vector Architecture

Vector Supercomputers

Multimedia Extensions (aka SIMD extensions)

The basic structure of a vector-register architecture

VMIPS

Vector Code Example

Automatic Code Vectorization

Vector Arithmetic Execution (deep pipeline + multiple independent memory banks)

Vector Stripmining

Vector Stride

Vector Chaining

Multiple Lanes

Array Processor

*GPU

MIMD (Thread-Level Parallelism)

Communication Models

Shared Memory Model

Message Passing

MIMD Memory Architecture: Centralized (SMP) vs. Distributed

The Flynn-Johnson classification of computer systems

Typical parallel computer architectures

對稱多處理機 SMP（Symmetric Multiprocessor）

工作站機群 COW (Cluster of Workstation)

大規模并行處理機 MPP (Massively Parallel Processor)

Cluster vs. MPP

Interconnection Networks

Cache Coherence and Coherence Protocol

What Is Multiprocessor Cache Coherence?

Cache Coherence problem

Coherent Memory Model

Coherent Memory System

Memory Consistency Model

Basic Schemes for Enforcing Coherence

Cache Coherence Protocols (HW)

Snooping Coherence Protocols

Write Invalidate, Write Update

Basic Implementation Techniques (Write Invalidate)

Broadcast Medium Transactions (e.g., bus)

Locate up-to-date copy of data

An Example Protocol (Snoopy, Invalidate)

Write-through Cache Protocol

Write Back Cache Protocol

繼續閱讀