under this model, each data partition has a single master and multiple slaves.
all update requests has to go to the master where update is applied and then asynchronously propagated to the slaves. notice that there is a time window of data lost if the master crashes before it propagate its update to any slaves, so some system will wait synchronously for the update to be propagated to at least one slave.
read requests can go to any replicas if the client can tolerate some degree of data staleness. this is where the read workload is distributed among many replicas. if the client cannot tolerate staleness for certain data, it also need to go to the master.
master slave model works very well in general when the application has a high read/write ratio. it also works very well when the update happens evenly in the key range. so it is the predominant model of data replication.
主從模式, 最傳統和簡單的模式
寫操作, 所有寫操作通過master, master寫成功即傳回, 然後master負責異步propagate到各個slave節點. 為了增強可靠性, 也可以等master至少propagate一個slave後再傳回.
讀操作, 如果可以容忍舊資料, 從任一節點讀. 如果不能容忍, 所有讀操作也要通過master
缺點, 單點問題, 以及master負載過重
解決辦法, 參考google的設計, gfs, bigtable
由于主從模式比較成熟和簡單
對于分布式的場景, 去中心化的設計(無固定master), 如何保證一緻性? 這才是近年來, 研究的難點和熱點
二階段送出(2pc)協定
傳統的2pc協定用于保證分布式事務的原子性, 分布式存放的資料, 必須要保證同時更新成功或失敗.
是以coordinator必須在第一階段, 發送prepare請求保證所有的資料複本目前都是ready for update, 在得到所有複本回應後再開始第二階段, 正真的commit
這裡就比基于master複雜, 不是僅僅master同意, 而是要所有的node都同意, 才能commit
to provide "strict consistency", we can use a traditional 2pc protocol to bring all replicas to the same state at every update.
lets say there is n replicas for a data. when the data is update, there is a "prepare" phase where the coordinator ask every replica to confirm whether each of them is ready to perform the update. each of the replica will then write the data to a log file and when success, respond to the coordinator. after gathering all replicas responses positively, the coordinator will initiate the second "commit" phaseand then ask every replicas to commit. each replica then write another log entry to confirm the update.
notice that there are some scalability issue as the coordinator need to "synchronously" wait for quite a lot of back and forth network roundtrip and disk i/o to complete.
on the other hand, if any one of the replica crashes, the update will be unsuccessful. as there are more replicas, chance of having one of them increases. therefore, replication is hurting the availability rather than helping. this make traditional 2pc not a popular choice for high throughput transactional system.
2pc協定的最大的問題是沒有考慮節點fail的case, 任意的節點的fail都會導緻block.
對于阻塞問題, 其實想當然的是可用通過timeout來解決, 當然問題沒有那麼簡單,
問題的核心在于,你無法區分一個程序到底是終止了還是正在以極低的速度執行,這使得在異步系統中的錯誤處理幾乎是不可能的
對于一個異步系統來說即使隻有一個程序出錯,分布式一緻性也是不可能達到的,這就是著名的flp結論
人們意識到一個分布式算法具有兩個屬性: 安全性(safety)和活性(liveness), 2pc極具安全性,卻缺乏活性
with uniform consensus all processes must agree on a value, even the faulty ones - a transaction should only commit if all rms are prepared to commit. most forms of consensus are only concerned with having the non-faulty processes agree. uniform consensus is more difficult than general consensus.
個人了解, 在節點或程序失效的時候, 仍然可以達成一緻性, 而不會存在2pc的block的情況
用于解決uniform consensus的問題.
paxos的核心, 在于quorum based 2pc, 在分布式環境既然無法要求所有節點能夠正常響應
那麼paxos隻需要majority(多數派)正常響應, 就可以達成一緻性決議, 進而避免任一節點fail導緻的block
但問題在于, 那些沒有響應的節點(因為fail或網絡等原因)怎樣保證其一緻性?
答案是, 任何一緻性決議的達成都需要majority的accept, 任意兩個majority集合都一定有交集(至少一個節點)
而任一節點都隻能accept一次proposal(除非具有相同的value), 是以當一個一緻性決議達成的情況下, 不可能有不同value新決議被達成(即使在部分節點fail的情況下)
進而即使fail的節點wake-up後, 仍然可以簡單的從其他majority節點learn并保證一緻性
這就是為什麼叫quorum based 2pc, 其實本質就是 r +w > n
并且在一段時間内無法獲得majority的響應時, 可以随時主動放棄現有提案, 并提出更高編号的提案, 進一步避免block
傳統2pc隻是paxos的一種特殊case (當w = n and r = 1)
a more efficient way is to use the quorum based 2pc (e.g. paxos).
in this model, the coordinator only need to update w replicas (rather than all n replicas) synchronously. the coordinator still write to all the n replicas but only wait for positive acknowledgment for any w of the n to confirm. this is much more efficient from a probabilistic standpoint.
as you can see, the quorum based 2pc can be considered as a general 2pc protocol where the traditional 2pc is a special case where w = n and r = 1. the general quorum-based model allow us to pick w and r according to our tradeoff decisions between read and write workload ratio.
本文章摘自部落格園,原文釋出日期:2013-04-03