现象
zookeeper版本为3.4.3, hbase版本为0.94.7。
按照zk的设计,一台机器down了之后应该仍然可以工作,但实际上应用中并不如此。
Zookeeper一台机器在生产环境中被挪走,客户端始终无法连接HBase。
问题排查
抛出如下异常:
Caused by: java.net.UnknownHostException: ops-new-launch-7237.iad7.amazon.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1259)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:440)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:375)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.<init>(RecoverableZooKeeper.java:98)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:127)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:153)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1395)
进入源码 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.zookeeper/zookeeper/3.4.3/org/apache/zookeeper/ZooKeeper.java#440
public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher,
boolean canBeReadOnly)
throws IOException
{
LOG.info("Initiating client connection, connectString=" + connectString
+ " sessionTimeout=" + sessionTimeout + " watcher=" + watcher);
watchManager.defaultWatcher = watcher;
ConnectStringParser connectStringParser = new ConnectStringParser(
connectString);
HostProvider hostProvider = new StaticHostProvider(
connectStringParser.getServerAddresses());
cnxn = new ClientCnxn(connectStringParser.getChrootPath(),
hostProvider, sessionTimeout, this, watchManager,
getClientCnxnSocket(), canBeReadOnly);
cnxn.start();
}
可以发现,在解析hostname的IP时候抛出的UnknownhostException, 并没有retry处理。
结论
zookeeper的并不是无条件容忍host的down,如果host从dns挪走的情况,它也不能处理。