文章目錄

概述
Primary-Backup Replication
- 故障恢複
- 複制政策
- - Replicated state machine
  - - 困難
    - 複制級别
VM FT
- Overview
- 容錯機制
- - Time Interrupts
  - Network Packets
  - non-deterministic instructions
  - output

概述

本文是MIT 6.824 Lec4的相關課程筆記。

Primary-Backup Replication

故障恢複

對于分布式系統來說，常見的故障有以下幾種：

fail-stop：server因各種原因停止工作
Bugs：軟體或硬體存在bugs
environment：地震，停電

replication機制隻能處理第一種和第三種故障。

複制政策

有兩種常見的複制政策。

State transfer：primary執行服務，并定時将狀态副本發送給backup。
Replicated state machine：client發送operations給primary，primary将執行的順序和operations發送給backup，backup與priamry執行相同的operations。

State transter的複制政策比較簡單，但是狀态的傳輸非常消耗資源。 Replicated state machine的方式隻需要傳輸少量的資料，但是保證primary和backup操作的一緻性需要比較複雜的機制。

Replicated state machine

困難

在采用這種複制方式時，通常需要解決以下幾個問題：

What state to replicate?
Does primary have to wait for backup?
When to cut over to backup?
Are anomalies visible at cut-over?
How to bring a replacement backup up to speed?

複制級别

application level：GFS，隻是複制應用相關的資料，如資料庫表等，非常高效。
machine level：複制server上發生的所有變動，包括RAM，寄存器，中斷等，比較複雜。

VM FT

Overview

primary将所有的接收到的外部輸入發送給backup vm，以保證一緻性
資訊以log entry的形式通過log channel進行傳輸
primary和backup共享外部disk server
隻有priamry需要和disk server進行通信，backup的output會被vmm丢棄

容錯機制

在VM FT中，當出現以下幾種事件可能會導緻primary和backup執行不一緻：

外部輸入（如，network packets），通常需要DMA + 中斷來進行處理
時鐘中斷
與狀态無關的相關操作，如擷取目前時間，擷取目前裝置ID
多核并行（不考慮，本文假設的都是單核情況）

在VM FT中，primary和backup的行為不一緻可能會導緻非常嚴重的問題。比如我們在VM中運作GFS master服務，primary chunkserver在60s的lease到期之前請求renew lease。在primary vm中，時鐘中斷發生在renew lease消息之後，是以chunkserver重新續約。如果在backup vm中，時鐘中斷發生在renew lease之前，此時租約就會過期。如果此時primary vm故障，backup vm接管，它會認為此時沒有primary chunkserver，就會重新頒發lease，這樣做的後果就是會産生split brain。

是以，backup vm和primary vm必須以相同的順序和在cpu指令流中的相同位置看到事件的發生。

VM FT中的log entry可能會包含以下資料：

instruction sequence number
type
data

Time Interrupts

下面舉例VM FT是如何處理時鐘中斷的。

Primary:

FT fields the timer interrupt
FT reads instruction number from CPU
FT sends “timer interrupt at instruction X” on logging channel
FT delivers interrupt to primary, and resumes it (this relies on CPU support to interrupt after the X’th instruction)

BackUp:

ignores its own timer hardware
FT sees log entry before backup gets to instruction X
FT tells CPU to interrupt (to FT) at instruction X
FT mimics a timer interrupt to backup

Network Packets

下面距離VM TF是如何處理網絡包的。

Primary:

FT tells NIC to copy packet data into FT’s private “bounce buffer”
At some point NIC does DMA, then interrupts
FT gets the interrupt
FT pauses the primary
FT copies the bounce buffer into the primary’s memory
FT simulates a NIC interrupt in primary
FT sends the packet data and the instruction # to the backup

BackUp:

FT gets data and instruction # from log stream
FT tells CPU to interrupt (to FT) at instruction X
FT copies the data to backup memory, simulates NIC interrupt in backup

bounce buffer是一個緩沖區，它可以保證primary vm和backup vm在收到input時，不會因為DMA機制導緻資料主線在記憶體中的時間不一樣。

non-deterministic instructions

下面距離VM FT時如何處理non-deterministic instructions的。

Primary:

FT sets up the CPU to interrupt if primary executes such an instruction
FT executes the instruction and records the result
sends result and instruction # to backup

BackUp:

FT reads log entry, sets up for interrupt at instruction #
FT then supplies value that the primary got

output

對于output操作，primary vm和backup vm都會産生輸出，但隻有primary vm的output有效，primary vm的output會被丢棄。

我們假設一個DB Server的例子來更好的立即VM FT的output機制，假設primary vm上部署了一個DB Server，并存儲有資料10，client支援increment操作。client向primary vm發起increment操作，primary收到後将input發送給backup vm，并更新自己的資料為11，将結果傳回給client。backup vm收到log entry後也執行increment操作，将自己的資料變為11，并産生output（被hypervisor丢棄）。

問題：如果primary發送了output之後當機，并且log channel出現異常，backup vm沒有收到log entry。此時backup vm接管primary，但是其記憶體中的資料為10，而不是11，出現不一緻的問題。

解決方案：output rule，primary 必須收到 backup 的ack後才能産生output

場景一：primary在收到ack前故障

FT流程：backup vm在replay完最後一個log entry時成為primary vm，并将output發送到client，通信正常。

場景二：primary在發送output後故障

FT流程：backup vm在接管後會再産生一次output，是以會産生兩次output。對于TCP連接配接來說，由于primary和back的狀态一樣，使用sequence number也一樣，是以會進行重複資料包處理。對于disk讀寫來說，會在同一個地方覆寫寫，是以都不會産生問題。

MIT 6.824 Lec4.Primary-Backup Replication概述Primary-Backup ReplicationVM FT

文章目錄

概述

Primary-Backup Replication

故障恢複

複制政策

Replicated state machine

困難

複制級别

VM FT

Overview

容錯機制

Time Interrupts

Network Packets

non-deterministic instructions

output

繼續閱讀

企業架構——資料架構之資料模組化

Kafka：Topic概念與API介紹

ZooKeeper ： Curator架構之分布式屏障DistributedDoubleBarrier

RabbitMQ：交換機（fanout exchange）

Doris SQL 原了解析

ZooKeeper ： Curator架構之分布式鎖InterProcessMutex

阿裡巴巴分布式服務架構 Dubbo 團隊成員梁飛專訪

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

微服務-性能壓測\緩存redis和分布式鎖redisson和SpringCache

Nacos 2.0 更新前後性能對比壓測

Spring資料和Redis

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

celery使用入門

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例