天天看点

Zookeeper问题排查

现象

zookeeper版本为3.4.3, hbase版本为0.94.7。

按照zk的设计,一台机器down了之后应该仍然可以工作,但实际上应用中并不如此。

Zookeeper一台机器在生产环境中被挪走,客户端始终无法连接HBase。

问题排查

抛出如下异常:

Caused by: java.net.UnknownHostException: ops-new-launch-7237.iad7.amazon.com
	at java.net.InetAddress.getAllByName0(InetAddress.java:1259)
	at java.net.InetAddress.getAllByName(InetAddress.java:1171)
	at java.net.InetAddress.getAllByName(InetAddress.java:1105)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:440)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:375)
	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.<init>(RecoverableZooKeeper.java:98)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:127)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:153)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:127)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1395)      
Zookeeper问题排查

进入源码 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.zookeeper/zookeeper/3.4.3/org/apache/zookeeper/ZooKeeper.java#440

public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher,
            boolean canBeReadOnly)
        throws IOException
    {
        LOG.info("Initiating client connection, connectString=" + connectString
                + " sessionTimeout=" + sessionTimeout + " watcher=" + watcher);

        watchManager.defaultWatcher = watcher;

        ConnectStringParser connectStringParser = new ConnectStringParser(
                connectString);
        HostProvider hostProvider = new StaticHostProvider(
                connectStringParser.getServerAddresses());
        cnxn = new ClientCnxn(connectStringParser.getChrootPath(),
                hostProvider, sessionTimeout, this, watchManager,
                getClientCnxnSocket(), canBeReadOnly);
        cnxn.start();
    }      

可以发现,在解析hostname的IP时候抛出的UnknownhostException, 并没有retry处理。

结论

zookeeper的并不是无条件容忍host的down,如果host从dns挪走的情况,它也不能处理。