amazon’s dynamo [9] and facebook’s cassandra [13], relax the consistency model,and offer only eventual consistency.
others such as hbase [1] and bigtable [4] offer strong consistency only for operations touching a single partition, but not across the database as a whole.
dynamo和cassandra都是基于vector clocks和gossip算法,強調高可用性,可以達到eventual consistency
而hbase,可以提供單分區的強一緻性,但對于跨分區,或跨資料庫,就無法保證;
for example, in a geo-replicated service, a user may first be routed to a local datacenter and perform some action such as sending an email message.
if they then reload their browser and are routed to another datacenter backend, they would still expect to see their sent message in their email outbox.
這個例子比較典型,要達到異地的一緻性
一般要解決分布式問題的一緻性問題的最簡單的辦法就是用synchronized clocks across all servers,但這基本是達不到的
the traditional approach to global consistency has been to use logical clocks.
logical clocks, such as lamport clocks [15] or vector clocks [10, 20].
是以傳統的做法就是用邏輯時鐘
<a href="https://www.zhihu.com/question/19994133">https://www.zhihu.com/question/19994133</a>
貼個回答,寫的不錯
一、就是現在以leslie lamport 提出的邏輯時鐘的方法,重新定義了一種分布式的全序關系。vector clock就是以邏輯時鐘為基礎上發展而來的。 二、使用實體時鐘,就是google spanner使用gps 加原子鐘進行時間同步,不同節點之間的時間是不能精确同步的,隻能把不同節點之間的時間同步到一個範圍之内,spanner一個節點上獲得一個時間戳,不是一個時間值,而是一個時間值加上一個誤差範圍,也就是一個時間窗,如果兩個時間窗有重合,就無法比較大小,進而無法比較出來時間窗代表的事件的先後關系。而時間同步算法就是讓分布式各節點之間的時間的誤差盡量的小,這樣取時間戳這個動作效率是最高的,隻在本節點取一下實體時間即可,但是為了同步時間,各個節點肯定在周期與其它節點進行通訊來同步實體時間(google時間同步的算法好像沒有開源)。時間同步算法隻能讓時間盡量的精确。分布式資料庫裡面是不能直接使用ntp來同步時間的,因為ntp時間同步協定裡面允許節點的時間能夠回退,而資料庫裡面要求本地的時鐘必須是遞增的。 三、還有一種混合邏輯時鐘(logical physical clocks),也是由邏輯時鐘演變而來,但是時鐘的值比較接近于實體時鐘,而且不依賴于下面實體時鐘的同步的算法,它的時間戳都可以在本地取,而且取出來是一個時間值,而不像spanner一樣,取出來是一個時間窗。詳細可以參見<b>logical physical clocks </b><b>and consistent snapshots in globally distributed databases</b>這篇論文
關于lamport或向量時鐘,
lamport時鐘就是邏輯時鐘,即不依賴于實體時鐘,而依賴event之間的因果關系來定義的一種偏序
而vector時鐘是,邏輯時鐘的一種實作,具體的邏輯看下文,
<a href="http://www.cnblogs.com/fxjwind/archive/2012/06/06/2537963.html">why vector clock are easy or hard?</a>
摘自上面的知乎回答
“dynamo paper是七、八年前寫的了,現在的amazon dynamo早已摒棄了version vector,而采用了synchronous replication(類似paxos的protocol),每一個partition會有一個leader負責寫入。其實version vector并不scale,因為對于一個key來說,随着writer數量的增加,version vector數量成指數級的增長”
邏輯時鐘的缺點,
first, they require that all clients must propagate clock data to achieve consistent views,
and second, the assigned timestamps have no relation to physical time.
spanner introduced commit-wait, a way of ensuring physical-time based consistent global state by forcing operations to wait long enough so that all participants agree that the operation’s timestamp has passed based on worst case synchronization error bounds.
while innovative, the system performance becomes highly dependent on the quality of the time synchronization infrastructure, and thus may have unacceptable performance absent specialized hardware such as atomic clocks and gps receivers.
spanner利用原子鐘和gps接收器,實作了一個較為精确的時鐘,這個時鐘叫做truetime,每次調用truetime api傳回的是一個時間區間,而不是一個具體的值,這個truetime保證的是真實時間(absolute time/real time)一定在這個區間内,這個區間範圍通常大約14ms,甚至更小。
是以隻有每個transaction的時間區間,都不重合,那麼就可以比較容易的做到全局有序;
是以這裡需要commit-wait,

可以看到,我在開始transaction的時候,取一次true time,[t1.earliest, t1.latest]
那麼什麼時候,我可以把這個transaction commit掉,即釋放掉鎖,即别人可以開始新的transaction
當我再取一次true time,[t2.earliest, t2.latest],當t2.earliest > t1.latest的時候,即可
因為這樣兩個transaction的時間區間就不可能重疊了,當然代價就是每個transaction都有等待大概2個error bound的時間
當然,spanner的實作依賴硬體的infra裝置,普通使用者無法複制;
hybridtime(ht), a hybrid between physical and logical clocks and we show how ht can be used to achieve the same semantics as spanner, but with good performance even with commonly available time synchronization.
like spanner, our approach relies on physical time measurements with bounded error to assign hybridtime timestamps to events that occur in the system.
however, unlike spanner, our approach does not usually require the error to be waited out, thus allowing for usage in common deployment scenarios where clocks are synchronized through common protocols such as the network time protocol, in which clock synchronization error is often higher than with spanner’s truetime.
the trade-off is that, in order to avoid commit-wait, hybridtime requires that timestamps be propagated across machines to achieve the same consistency semantics as spanner.
contrary to vector clocks, which can expand as the number of participants in the cluster grows, hybridtime timestamps have constant and small size.
hybridtime說是physical and logical clocks的結合,和spanner一樣,也可以使用帶error bound的實體時間
但由于他沒有spanner的硬體設施,是以做不到低延遲的時間同步,是以他依然使用邏輯時鐘來達到一緻性;并且他解決了vector clocks随着client的數量增加而會size膨脹的問題
hybridtime clocks follow similar update rules to lamport clocks, but the time values are not purely logical:
each time value has both a logical component, which helps in guaranteeing the same properties as a lamport clocks, and a physical component which allows the event to be associated with a physical point-in-time.
one of spanner’s key benefits is that is externally consistent, which is defined as fully linearizable, even in the presence of hidden channels.
additionally we use the term globally consistent to describe a system which provides the same linearizability semantics, provided that there are no hidden channels present.
區分為,是否有hidden channels?意思是clients間存在因果或先後關系,比如通過發message,都是這個因果關系對于系統是不可知的,比如a write,然後發送消息給b,然後b write,那麼因果關系上,b write一定後于a write,但是對于資料庫而言他并不知道,是以并不能保證a write先寫入;
spanner是可以保證externally consistent, 由于他通過 commit wait,來保證每個transaction之間一定是全局有序的
相對的globally consistent,更簡單些,因為并沒有hidden channels
spanner’s key innovation is that timestamps assigned by the system can be used to achieve external consistency, but also have physical meaning.
hybridtime is always globally consistent, and through selective application of commit-wait is externally consistent.
spanner的主要創新在于實作external一緻性的,同時還保留了實體時間的
而hybridtime,預設情況下是可以實作globally consistent,即偏序,因為他本身就是使用lamport時鐘,并且當你選擇commit-wait時,也可以保證externally consistent;
hybridtime assumptions
1. hybridtime assumes that machines have a reasonably accurate physical clock, represented by the pci(e) function, which outputs the numeric timestamp returned by the physical clock as read by process i for event e, that is able to provide absolute time measurements (usually in milli- or microseconds since 1 january 1970).
2. keeps the physical clocks across different servers synchronized with regard to a reference server, the “reference” time, represented by the pcref(e) function which outputs the numeric timestamp returned by the “reference” process for event e;
additionally, we assume that such a substrate is able to provide an error bound along with each time measurement, denoted by the ei(e) function, which outputs the numeric value ε error of process i at the time e occurred
3. we make no assumptions on the actual accuracy of the clocks, i.e. the physical timestamps returned by server’s clocks may have an arbitrarily large but finite error, as long as this error’s bound is known
說白了,假設
有個相對靠譜的實體時鐘,一個理想的參照時鐘,以及他們之間相差的,error bound
最後,我們并不假設這個error bound會很小,隻要有限就可以
假設1,我們有有限的error bound
假設2, 程序級别的實體時鐘單調遞增,注意是程序級别
基于以上假設,提出hybridtime clock and protocol
hybridtime clock and protocol
hybridtime clock(htc) is a pair (physical,logical) where the first component is a representation of the physical time at which the event occurred and the second component is a logical sequence number.
定義其實顯而易見。。。
algorithm 2 depicts the htc algorithm.
htc算法如上,兩部分,now和update
now就是取目前的hybridtime clock
upate就是根據in 來更新目前的clock
algorithm 2 implements a lamport clock, with the additional advantage that generated timestamps have physical meaning and are accurate representations of physical time within a bound error.
算法本身,其實就是lamport clock,隻是增加了physical clock的部分
給出一個例子,
to order the events timestamped using the hybridtime clock algorithm we use definition 1.
definition 1. hct(e) < hct( f ) is defined as the lexicographical ordering of the timestamp two-tuple (physical,logical)
theorem 1. the hybritime clock happened-before relation forms a total order of events
theorem 2. for any event in a causal chain f , the physical component of a htc timestamp approximates the “real” time the event occurred, with a error defined and bounded by
implementation
no consistency - in this mode there are no external consistency guarantees, transactions are assigned timestamps from each server’s physical clock and no guarantee is made that reads are consistent or repeatable.
直接用local的實體時間,不保證一緻性
hybridtime consistency - in this mode our implementation guarantees the global consistency as spanner, absent hidden channels, but using hybridtime instead of commit-wait.
clients choosing this consistency mode on writes must make sure that the timestamp that is received from the server is propagated to other servers and/or clients.
within the same client process, timestamps are automatically propagated on behalf of the user.
這個其實和邏輯時鐘,沒有差別,就是保證偏序
commit-wait consistency - in this mode our implementation guarantees the same external consistency semantics as spanner by also using commit-wait in the way described in the original paper.
however instead of using truetime, which is a proprietary and private api, we implemented commit-wait on top of the widely used network time protocol (ntp). hence, in this consistency mode we support hidden channels.
這個估計沒啥用,在沒有spanner硬體保證的情況下,commit-wait,性能不能忍吧