天天看點

Zookeeper原理與Curator使用

近期打算實作一個基于Zookeeper的分布式的叢集狀态一緻性控制, 對Zookeeper的原理不太了解, 正好學習一下, 網上找到了幾篇文章, 先貼在這邊, 等我熟讀官方文檔後, 再來補充自己的見解

-----------------------------我是分割線-------------------------------------

最近基于Zk實作了一套公司風控系統的規則管理和叢集管理, 對zk和curator有了更加深入的認識, 下面就踩過的坑記錄下

1. curator 有兩套監聽機制, 一個是封裝了zk自身的watcher, 一個是自己的listener, 坑來了:

  a.listener 隻能監聽相同thread的client事件, 跨thread或者跨process則不行, 操作必須使用inbackground()模式才能觸發listener

  b.watcher 封裝了zk原本的watcher 可以跨程序使用, 但是注意, 無法在 inbackground的情況下觸發watcher

2. zk watcher 定義了4種事件 

public enum EventType {

None (-1),

NodeCreated (1),

NodeDeleted (2),

NodeDataChanged (3),

NodeChildrenChanged (4);

}

坑來了

怎樣才能得到自己想要的事件? 

  a. 想監聽 NodeCreated, NodeDeleted, NodeDataChanged 可以使用 checkExist 或者 getData, 推薦使用checkExist, 因為getData 如果結點未建立則報錯

  b. 想監聽 NodeChildrenChanged 隻能使用 getChildren, 但是注意不能監聽嵌套内層的子節點, 如 /test/1  不能獲得 /test/1/2/3 的變動 , 可以獲得 /test/1/2 的變動, 而且每次變動的path 永遠都是你監聽的那個path, 不要妄想用它來獲得子節點的path

這裡有篇文章不錯, http://blog.csdn.net/lzx1104/article/details/6968802

http://liuqunying.blog.51cto.com/3984207/1407455

http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/

https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_advancedConfiguration

https://github.com/Netflix/curator

The ZooKeeper Data Model

ZooKeeper has a hierarchal name space, much like a distributed file system. The only difference is that each node in the namespace can have data associated with it as well as children. It is like having a file system that allows a file to also be a directory. Paths to nodes are always expressed as canonical, absolute, slash-separated paths; there are no relative reference. Any unicode character can be used in a path subject to the following constraints:

Zk的node可以含有資料也可含有子節點, 路徑不支援的unicode如下

  • The null character (\u0000) cannot be part of a path name. (This causes problems with the C binding.)
  • The following characters can't be used because they don't display well, or render in confusing ways: \u0001 - \u001F and \u007F - \u009F.
  • The following characters are not allowed: \ud800 - uF8FF, \uFFF0 - uFFFF.
  • The "." character can be used as part of another name, but "." and ".." cannot alone be used to indicate a node along a path, because ZooKeeper doesn't use relative paths. The following would be invalid: "/a/b/./c" or "/a/b/../c".
  • The token "zookeeper" is reserved.

ZNodes

Every node in a ZooKeeper tree is referred to as a znode. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn't match the actual version of the data, the update will fail. (This behavior can be overridden. For more information see... )[tbd...]

每一個Zk節點樹上的節點被認為是一個Znode, Znode維護了一個狀态結構, 包含資料版本号(data, acl--Access Control List), timestamp. versionNumber和Timestamp 相結合, 來驗證Zk的Cache, 并在更新時保證資料一緻性.

每當Znode的資料發生變化, versionNumber自增, 比如, 每當一個client獲得資料, 它同時會獲得資料的版本号, 當client嘗試去更新或删除, 它必須提供版本号. 如果提供的版本号和Zk的不一緻, 更新将會失敗. [類似于資料庫的樂觀鎖]

Note

In distributed application engineering, the word node can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation, znodes refer to the data nodes. Servers refer to machines that make up the ZooKeeper service; quorum peers refer to the servers that make up an ensemble; client refers to any host or process which uses a ZooKeeper service.

Znodes are the main enitity that a programmer access. They have several characteristics that are worth mentioning here.

Watches

Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section ZooKeeper Watches.

Data Access

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. Operating on relatively large data sizes will cause some operations to take much more time than others and will affect the latencies of some operations because of the extra time needed to move more data over the network and onto storage media. If large data storage is needed, the usually pattern of dealing with such data is to store it on a bulk storage system, such as NFS or HDFS, and store pointers to the storage locations in ZooKeeper.

Ephemeral Nodes

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.

Sequence Nodes -- Unique Naming

When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding (the counter is formatted in this way to simplify sorting), i.e. "<path>0000000001". See Queue Recipe for an example use of this feature. Note: the counter used to store the next sequence number is a signed int (4bytes) maintained by the parent node, the counter will overflow when incremented beyond 2147483647 (resulting in a name "<path>-2147483647").

Time in ZooKeeper

ZooKeeper tracks time multiple ways:

  • Zxid

    Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.

  • Version numbers

    Every change to a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode).

  • Ticks

    When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout.

  • Real time

    ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.

ZooKeeper Stat Structure

The Stat structure for each znode in ZooKeeper is made up of the following fields:

  • czxid--建立id

    The zxid of the change that caused this znode to be created.

  • mzxid--更新id

    The zxid of the change that last modified this znode.

  • ctime--建立時間

    The time in milliseconds from epoch when this znode was created.

  • mtime--更新時間

    The time in milliseconds from epoch when this znode was last modified.

  • version

    The number of changes to the data of this znode.

  • cversion

    The number of changes to the children of this znode.

  • aversion

    The number of changes to the ACL of this znode.

  • ephemeralOwner

    The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.

  • dataLength

    The length of the data field of this znode.

  • numChildren

    The number of children of this znode.

ZooKeeper Watches

All of the read operations in ZooKeeper - getData(), getChildren(), and exists() - have the option of setting a watch as a side effect. Here is ZooKeeper's definition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes. There are three key points to consider in this definition of a watch:

所有的zookeeper讀操作, 都可以設定一個watch getData(), getChildren(), exists(), zookeeper的定義如下: 一個watch event是一個一次性trigger, 被發送到設定它的client, 當資料變化時, 對應的watch起效.

  • One-time trigger

    One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.

一個watch event将被發送到client當data變化, 例如, 一個client調用getData("/znode1", true), 當/znode1的資料發生變化, 如果znode再次發生變化, 将不會有event發送, 除非client再次擷取資料并設定新的watch

  • Sent to the client

    This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.

zk會保證watch event的順序, 防止網絡延遲或其他原因導緻的異步時序問題

  • The data for which the watch was set

    This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

zk維持兩個watch list, data和child的watch, 用getData(), exist()設定Data的watch, 用getChildren設定child watch. 

它有助于幫助我們思考傳回資料問題, getdata(),exist()傳回node節點資訊, getChildren()傳回子節點數組, 是以setData将會觸發znode的Data watch(watch傳回znode的節點資訊, 前提是set成功).  成功的create()将會觸發znode的Data watch和父節點的childWatch, 成功的delete()将會觸發data watch和child watch和父節點的child watch

Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

watches在zookeeper節點維護, 如果client端重連不會導緻watches失效, 這一切對client端透明, 但是除了一種情況, 如果znode在連接配接丢失時被建立或者删除, 判斷這個Znode存在與否的watch将會miss

Semantics of Watches

We can set watches with the three calls that read the state of ZooKeeper: exists, getData, and getChildren. The following list details the events that a watch can trigger and the calls that enable them:

  • Created event:

    Enabled with a call to exists.

  • Deleted event:

    Enabled with a call to exists, getData, and getChildren.

  • Changed event:

    Enabled with a call to exists and getData.

  • Child event:

    Enabled with a call to getChildren.

Remove Watches

We can remove the watches registered on a znode with a call to removeWatches. Also, a ZooKeeper client can remove watches locally even if there is no server connection by setting the local flag to true. The following list details the events which will be triggered after the successful watch removal.

  • Child Remove event:

    Watcher which was added with a call to getChildren.

  • Data Remove event:

    Watcher which was added with a call to exists or getData.

What ZooKeeper Guarantees about Watches

With regard to watches, ZooKeeper maintains these guarantees:

  • Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.
  • A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
  • The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

  • Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.
  • Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)
  • A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.
  • When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

1. watch保證有序, watch是一次性的

2. 相同類型的watch在同一個znode被設定多次, 但隻會觸發一次

3. zk與client的連接配接丢失, client不會得到任何的watch event直到連接配接重新被建立, 是以session event

4. 由于watch是一次性的. 會有這種潛在情況, 擷取event和發送request去擷取watch, 不一定會擷取這個節點的每一次變動, 是以要準備去處理這種case, 起碼要有這種意識.

Gotchas: Common Problems and Troubleshooting

So now you know ZooKeeper. It's fast, simple, your application works, but wait ... something's wrong. Here are some pitfalls that ZooKeeper users fall into:

  1. If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existance, you will miss the event if the znode is created and deleted while you are disconnected.

用watches, 你必須注意連接配接問題, 如果client連接配接斷開, 你不會收到任何event除非重連, 如果你在監聽一個znode的exist event, 那麼連接配接中斷你将miss掉這個節點的watch event

  1. You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.

必須測試Zkserver失敗的情況, 看看application是否能夠正常工作

  1. The list of ZooKeeper servers used by the client must match the list of ZooKeeper servers that each ZooKeeper server has. Things can work, although not optimally, if the client list is a subset of the real list of ZooKeeper servers, but not if the client lists ZooKeeper servers not in the ZooKeeper cluster.

client端使用的zkServer清單, 必須和ZkServer本身配置的清單比對, 否則有可能出現client的ZkServer不在Zk叢集中的情況

  1. Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it.
  1. Set your Java max heap size correctly. It is very important to avoid swapping. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk.

    To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with.

正确設定java max heap size , 對于防止swaping很重要, 頻繁的進行磁盤交換将會大幅影響性能, 由于Zk是有序的, 如果一個request hit到磁盤, 那麼其他後續的一定也是到磁盤

防止swapping, 嘗試設定heap zise到實體記憶體大小, 留給OS和cache一點空間, 最好的做法是進行性能測試, 如果不行的話, 建議是4G的機器, 3G的heap size, 大約3/4左右

轉載于:https://www.cnblogs.com/zhwbqd/p/3969161.html