Hypertable運維中遇到的問題

1.failed expectation: insert_result.second

版本：0.9.7.5之前

問題描述：

在自動failover進行的過程中，手動将之前出問題的機器帶回叢集，Master日志中出現了如下類似錯誤：

(/root/src/hypertable/src/cc/Hypertable/Master/RangeServerConnectionManager.cc:50) Contains rs5 host=dlxa125 local=0.0.0.0:0 public=*.*.*.*:38060

FATAL Hypertable.Master : (/root/src/hypertable/src/cc/Hypertable/Master/RangeServerConnectionManager.cc:52) failed expectation: insert_result.second

解決方法：

有兩種解決方法：

1）. 社群已經在0.9.7.5版本中解決了這個問題，更新到該版本或者之後的版本；

2）. 也可以在自動failover完成後，再手動帶回之前出問題的機器。不建議在自動failover未完成時執行帶回操作，這樣有百害而無一利。

2.failed expectation: m_trailer.filter_items_estimate

版本：0.9.6.5

問題描述：

RangeServer日志中出現了如下類似錯誤，重新開機亦是如此：

FATAL Hypertable.RangeServer : (/root/hypertable/src/cc/Hypertable/RangeServer/CellStoreV6.cc:502) failed expectation: m_trailer.filter_items_estimate

解決方法：

社群已經在0.9.7.0版本中解決了這個問題，更新到該版本或者之後的版本

3.select的limit子句bug

版本：0.9.7.12以前

問題描述：

limit 1傳回結果集為空，但是limit 2卻能得到非空的結果集。

解決方法：

這是一個bug（issue1175），社群已經在0.9.7.14版本中解決了這個問題，更新到該版本或者之後的版本

4.從0.9.7.11更新到0.9.7.14後入庫速度變慢

版本：0.9.7.13--0.9.7.16

問題描述：

從0.9.7.11更新到0.9.7.14後發現入庫速度變慢，但是更新到0.9.7.12的入庫速度正常

解決方法：

這是一個bug（issue1179），發現涉及的版本包括0.9.7.13-0.9.7.16。社群已經在0.9.7.17版本中解決了這個問題，更新到該版本或者之後的版本

5. failed expectation: split_row.compare(end_row)<0 && split_row.compare(start_row) > 0

版本：0.9.7.12

問題描述：

RangeServer日志中出現了如下類似錯誤，重新開機亦是如此：

(root/src/hypertable/src/cc/Hypertable/RangeServer/range.cc 1028) failed expectation: split_row.compare(end_row)<0 && split_row.compare(start_row) > 0

解決方法：

這是一個bug（issue1193），該bug觸發條件為：未開啟自動failover時，如果個别RangeServer節點挂了，雖然導緻了大量入庫阻塞，但是部分機器還是可以寫入，于是寫入的資料持久化到commit log之後就會引起這個bug。有兩種解決方法：

1）社群已經在0.9.7.17版本中解決了這個問題，更新到該版本或者之後的版本

2）如果不能更新，可以手動删除出問題RangeServer的所有 commit log檔案，此方法雖然可以規避bug，但是代價是損失了一些資料。

6. Dfsbroker程序占據了太多的端口，導緻其它程序不能啟動

版本：0.9.7.12

問題描述：

發現RangeServer機器上的MapReduce程式不能啟動，分析後發現是因為啟動時擷取不到端口。假設Dfsbroker程序ID為5000，執行下列指令可檢視該程序占據了多少端口：

[[email protected] ~]# netstat -antup|grep 5000|wc -l

42911 //我們當時檢視到的占據的端口數

解決方法：

這是一個bug（issue1197），但不确定社群是否做了修補，因為這個問題很難重制。可以重新開機dfsbroker來解決此問題。

7. failed expectation: *ptr == Key::AUTO_TIMESTAMP || *ptr == Key::HAVE_TIMESTAMP

版本：0.9.7.12

問題描述：

RangeServer日志中出現如下類似錯誤：

(/root/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1958)failed expectation: *ptr == Key::AUTO_TIMESTAMP || *ptr == Key::HAVE_TIMESTAMP

解決方法：

對此錯誤，社群說明可以總結為以下三點：

1）觸發條件：當插入cell的timestamp為TIMESTAMP_NULL（源碼中定義的一個常量）時，會引發這個錯誤；

2）目前觸發時機不合理，即不應該在RangeServer端抛出這個錯誤，應該在用戶端寫入時抛出這個錯誤；

3）将在0.9.7.17中修改為合理的觸發時機。

但是我按照上述觸發條件模拟時，沒有看到這個錯誤。并且在0.9.7.17中仍然看到了這個錯誤。社群承認在0.9.7.17中修改時有遺漏，承諾在0.9.8.0中修補此遺漏。

雖然目前此問題沒有徹底解決，但是重新開機RangeServer時此問題不是必現問題，既可以通過重新開機RangeServer規避此問題。

8. failed expectation: m_last_collection_time

版本：0.9.7.17

問題描述：

當減少一個表的TTL屬性值後，例如：表TAB_TEST原TTL為10 day，執行Alter table TAB_TEST modify (F1 TTL=5 day)修改為5 day，RangeServer中出現如下錯誤：

FATAL Hypertable.RangeServer :(/root/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroupGarbageTracker.cc:132)failed expectation: m_last_collection_time

解決方法：

社群說當上述的alter語句執行後，會在表中增加一個名稱為default的新的access group，是以後續對其執行gc引起發這個assert。社群已經在0.9.7.18中修補了這個問題，更新到該版本或者之後的版本可以避免此問題。

9. Error reading directory entries for DFS directory: /hypertable/servers/rs*/log/tab_id/endrow

版本：0.9.7.0以前

問題描述：

RangeServer啟動後又會挂掉，日志中具有類似于如下的提示：

ERROR Hypertable.RangeServer :local_recover (/root/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:908): Hypertable::Exception: Error reading directory entries for DFS directory: /hypertable/servers/rs8/log/3/19/mhXJ7oobjVGJEjtY-1356744895 - DFS BROKER file not found

at void Hypertable::CommitLogReader::load_fragments(Hypertable::String, Hypertable::CommitLogFileInfo*) (/root/src/hypertable/src/cc/Hypertable/Lib/CommitLogReader.cc:204)

at virtual void Hypertable::DfsBroker::Client::readdir(constHypertable::String&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:621): Error reading directory entries for DFS directory: /hypertable/servers/rs8/log/3/19/mhXJ7oobjVGJEjtY-1356744895

at virtual void Hypertable::DfsBroker::Client::readdir(constHypertable::String&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:615): File hdfs://namenode:8020/hypertable/servers/rs8/log/3/19/mhXJ7oobjVGJEjtY-1356744895 does not exist.

解決方法：

手動建立一個空的目錄（sudo –u hdfs hadoop fs –mkdir /hypertable/servers/rs8/log/3/19/mhXJ7oobjVGJEjtY-1356744895），然後重新啟動RangeServer服務。

10. 如何删除臨時表

版本：目前的所有版本

問題描述：

當查詢結果比較大時，會産生一些臨時表，例如：tmp/08eb3540-8ba1-44a8-ae9c-9c778ef5814d。查詢正常結束時，臨時表會被清除，但是如果查詢異常結束，這些臨時表将會一直存在。

解決方法：

使用下列指令删除臨時表：

use tmp;

drop table "08eb3540-8ba1-44a8-ae9c-9c778ef5814d"; //即在表名上加引号

11. 如何增加scanner_get_cell方法傳回的Cell數目

版本：目前的所有版本

問題描述：

當使用scanner_get_cell接收查詢結果時，如果查詢結果較大，需要接收多次。為了減少接收次數，可以設定增加該方法每次接收的Cell數目。

解決方法：

修改hypertable.cfg檔案中的以下兩個配置項，修改後需要重新開機Hypertable：

Hypertable.RangeServer.Scanner.BufferSize=2000000 //預設值為1M

ThriftBroker.NextThreshold=256000 //預設值為128K

12. RANGE SERVER row overflow

版本：目前的所有版本

問題描述：

RangeServer日志中出現如下類似錯誤：

ERROR Hypertable.RangeServer : operator() (/root/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:162): Hypertable::Exception: Unable to determine split row for range 2/9[9999999103-5088A416Bzz..D0---0---0---0---0-000] - RANGE SERVER row overflow

解決方法：

本問題是由于row設計不合理，或者入庫程式生成了非預期的row所緻。Hypertable目前規定一個row不能跨range存在，是以當出現前述兩個問題的話，則可能出現一個row的cell太多，進而“擠爆”了一個range，進而出現此錯誤。有兩種解決方法：

1）修改Hypertable.RangeServer.Range.RowSize.Unlimited為true，然後重新開機hypertable。此方法不推薦，應盡量使用第二種方法；

2）設計合理的row，并保證入庫程式生成預期的row。

注意不能通過增大Hypertable.RangeServer.Range.SplitSize配置值來解決此問題，因為該配置項修改後不能改變之前已經生效的資料。

13. ThriftBroker抛出異常“org.apache.thrift.transport.TTransportException: Frame size (144130118) larger than max length (16384000)!

版本：目前的所有版本

問題描述：

當使用HQL或者Thriftclient查詢時，如果結果集較大（例如百萬計行），則會抛出這個異常

解決方法：

1）當使用hql_exec API時，請設定參數unbuffered為true；

2）連接配接ThriftBroker時修改framesize和timeout，framesize預設值為20M，timeout預設值為1600000毫秒。例如以下為Java代碼将framesize修改為20M：

ThriftClienttc = ThriftClient.create("localhost", 38080, 1600000, true, 20 * 1024 * 1024);

3）在MapReduce時，可在配置檔案中修改配置項：hypertable.mapreduce.thriftclient.framesize

14. Hypertable::Exception: decoding scan spec - SERIALIZATION input buffer overrun

版本：目前的所有版本

問題描述：

RangeServer日志中出現如下類似錯誤：

ERROR Hypertable.RangeServer : run (/root/src/hypertable/src/cc/Hypertable/RangeServer/RequestHandlerCreateScanner.cc:65): Hypertable::Exception: decoding scan spec - SERIALIZATION input buffer overrun

解決方法：

此錯誤一般是由于用戶端和伺服器端使用了不同的hypertable版本，假設：叢集更新時，叢集端已經更新為0.9.7.17，但是入庫程式還使用的是0.9.7.12的庫，此時可能會出現此問題。隻要将用戶端程式在新的版本下編譯，或者用戶端采用新的jar即可。

15. 大部分RangeServer在執行GC compaction時記憶體暴漲，導緻叢集無法提供服務

版本：0.9.7.12

問題描述：

如題目所述

解決方法：

社群承認之前的GC compaction算法有些混亂，在0.9.7.17中已經大改了這個邏輯。更新到該版本或者之後的版本即可解決此問題。

16. java.io.IOException: All datanodes *.*.*.*:50010 are bad. Aborting

版本：目前的所有版本

問題描述：

RangeServer日志中出現如下類似錯誤：

ERROR Hypertable.RangeServer : run_compaction (/root/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:790): 2/0[1065795555-52B2C735BKA..10658000---52ABC9F3BKA](IDX) Hypertable::Exception: Problem writing to DFS file '/hypertable/tables/2/0/IDX/YStGMh3ejopuOguz/cs717' : java.io.IOException: All datanodes *.*.*.*:50010 are bad. Aborting... - DFS BROKER i/o error

at virtual void Hypertable::CellStoreV6::add(const Hypertable::Key&, Hypertable::ByteString) (/root/src/hypertable/src/cc/Hypertable/RangeServer/CellStoreV6.cc:478)

解決方法：

雖然看起來很明顯是通路HDFS出錯，但是這是一個比較複雜的問題，目前已知兩種情況可能引發這個錯誤：

1）HDFS的DataNode出現卷故障，故導緻卷通路失敗。此時修複故障卷即可。

2）作業系統或者HDFS的一些特殊權限被修改。這種情況後果很嚴重，極端情況下，隻要對hypertable執行寫入就報這個錯誤。如果不能确定被修改的權限項，即使cleandb也不能解決此問題，隻能從重裝作業系統開始了。是以杜絕此問題的方法就是嚴格管理叢集的權限。

17. java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing

required fields: callId, status

版本：目前的所有版本

問題描述：

Master日志中出現如下類似錯誤：

ERROR Hypertable.Master : main (/root/src/hypertable/src/cc/Hypertable/Master/main.cc:384): Hypertable::Exception:

Error checking existence of DFS path: /hypertable/servers/master/log/mml - DFS BROKER i/o error

at virtual bool Hypertable::DfsBroker::Client::exists(const String&) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client

.cc:679)

at virtual bool Hypertable::DfsBroker::Client::exists(const String&) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client

.cc:673): java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing

required fields: callId, status

解決方法：

這是由于hypertable和cdh版本不相容引起的，一般是由于使用了過高的cdh版本。解決此問題需要執行下列修改：

1）修改hypertable源碼目錄下src\java\Core\org\hypertable\DfsBroker\hadoop\2\HadoopBroker.java中

DFSClient.DFSDataInputStream in = (DFSClient.DFSDataInputStream)mFilesystem.open(path)為

HdfsDataInputStream in = (HdfsDataInputStream)mFilesystem.open(path)

然後将該檔案編譯後替換“hypertable-版本号.jar”中同名的class；

2）在hypertable安裝目錄的子目錄lib/java中，将所有cdh的jar替換為cdh新版本的jar。然後重新開機hypertable即可。

18. java.io.IOException: Failed to add a datanode

版本：目前的所有版本

問題描述：

RangeServer日志中出現如下類似錯誤：

java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in

解決方法：

此問題一般出現在小的HDFS叢集中（5個datanode以内）。需要在各個DataNode節點的hdfs-site.xml中添加下列配置片段即可：

<property> <name>dfs.client.block.write.replace-datanode-on-failure.enable</name> <value>false</value> </property>

也可通過cloudera manager添加上述配置片段，需要操作的配置項為HDFS Client Advanced Configuration Safety Valve。對于cloudera manager4系列，配置入口為：Client->advanced，對于cloudera manager5系列，配置入口為：Gateway Default Group->advanced。修改後需要執行Actions->Deploy Client Configuration，實踐發現在HDFS服務頁面執行下發操作不生效，但在YARN服務頁面執行則會生效。

19. Master MetaLog本地備份檔案和HDFS檔案不一緻

版本：目前的所有版本

問題描述：

Master不能正常啟動，并且日志中出現類似下列錯誤：

ERROR Hypertable.Master : main (/root/src/hypertable/src/cc/Hypertable/Master/main.cc:384): Hypertable::Exception: MetaLog file '/hypertable/servers/master/log/mml/2' has length 7730400 backup file '/dinglicom/hypertable/0.9.7.17/run/log_backup/mml/master_38050/2' length is 7724704 - METALOG backup file mismatch at void Hypertable::MetaLog::Reader::verify_backup(int32_t) (/root/src/hypertable/src/cc/Hypertable/Lib/MetaLogReader.cc:129)

解決方法：

此問題是由于存儲在HDFS上的Master MetaLog與本地備份檔案不同步所緻，是以需要手動同步。同步方法就是用錯誤資訊中提到的HDFS上的MetaLog檔案替換本地備份的同名檔案，然後重新開機叢集即可，本例中就是用/hypertable/servers/master/log/mml/2 替換 /dinglicom/hypertable/0.9.7.17/run/log_backup/mml/master_38050/2。當然，為了保險起見，建議替換前先備份檔案。

Hypertable運維中遇到的問題

繼續閱讀

了解Hypertable

CentOS通過配置獨立的yum倉庫安裝nginxRHEL/CentOS 系列安裝nginx

開源資料庫全接觸－MongoDB，Cassandra，Hypertable，CouchDB，Redis，HBase，Voldemort等35款資料庫簡介...

Hypertable的更新安裝

Hypetable源碼編譯和打包

hypertable使用小記，持續更新中。。。

Hypertable在Delete指令後執行Insert指令時應該注意的問題

HQL的“DUMP table”語句應該注意的問題

可能導緻Hypertable啟動慢的原因

Hypertable源碼解讀之Hypertable.lib目錄

Hypertable工具之csdump

Hypertable源碼解讀之Hypertable.RangeServer目錄

Hypertable源碼解讀之RangeServer啟動過程

Linux删除軟連結不要使用rm -f起因經過删除軟連結的正确姿勢