天天看點

伺服器強制關閉、異常斷電等導緻clickhouse資料損壞Suspiciously many broken parts to remove

文章目錄

    • 問題
      • 現象
      • 原因
      • 解決
    • 其他處理方式
      • 單表配置方式
      • 指令行方式
      • 配置檔案方式
      • 驗證配置是否生效
    • 參考

問題

現象

  • 機房斷電,伺服器重新開機服務啟動後,發現資料寫入報錯,檢視clickhouse報錯日志
  • 關鍵提示為

    TOO_MANY_PARTS

    DB::Exception: Suspiciously many (12 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 10.

    ,大概意思是太多資料碎片,要移除的損壞碎片為12,但允許的最大數量為10
  • 後面也給了修複意見

    You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value)

    。大緻意思是,修改表參數

    max_suspicious_broken_parts

    ,但是别忘了恢複後改回來
  • 也指明了出錯的表的part

    Cannot attach table radar.signal_status from metadata file /var/lib/clickhouse/store/422/4222e684-3a04-4de6-bacc-879c855ef94c/signal_status.sql from query ATTACH TABLE radar.signal_status

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xaebed1a in /usr/bin/clickhouse
1. DB::Exception::Exception<unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long&) @ 0xaf096b2 in /usr/bin/clickhouse
2. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*) const @ 0x15488280 in /usr/bin/clickhouse
3. ? @ 0x15c0a4e9 in /usr/bin/clickhouse
4. DB::ExceptionKeepingTransform::work() @ 0x15c09c94 in /usr/bin/clickhouse
5. DB::ExecutionThreadContext::executeTask() @ 0x15a60ca3 in /usr/bin/clickhouse
6. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x15a54b7e in /usr/bin/clickhouse
7. DB::PipelineExecutor::executeStep(std::__1::atomic<bool>*) @ 0x15a54280 in /usr/bin/clickhouse
8. DB::PushingPipelineExecutor::start() @ 0x15a68d14 in /usr/bin/clickhouse
9. DB::SystemLog<DB::MetricLogElement>::flushImpl(std::__1::vector<DB::MetricLogElement, std::__1::allocator<DB::MetricLogElement> > const&, unsigned long) @ 0x14d9b618 in /usr/bin/clickhouse
10. DB::SystemLog<DB::MetricLogElement>::savingThreadFunction() @ 0x14d99ab5 in /usr/bin/clickhouse
11. ? @ 0xaf80cb1 in /usr/bin/clickhouse
12. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xaf62837 in /usr/bin/clickhouse
13. ? @ 0xaf662fd in /usr/bin/clickhouse
14. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so
15. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so
 (version 22.2.2.1)
2022.12.08 15:07:08.647866 [ 18188 ] {} <Error> void DB::SystemLog<DB::MetricLogElement>::flushImpl(const std::vector<LogElement> &, uint64_t) [LogElement = DB::MetricLogElement]: Code: 252. DB::Exception: Too many parts (300). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xaebed1a in /usr/bin/clickhouse
1. DB::Exception::Exception<unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long&) @ 0xaf096b2 in /usr/bin/clickhouse
2. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*) const @ 0x15488280 in /usr/bin/clickhouse
3. ? @ 0x15c0a4e9 in /usr/bin/clickhouse
4. DB::ExceptionKeepingTransform::work() @ 0x15c09c94 in /usr/bin/clickhouse
5. DB::ExecutionThreadContext::executeTask() @ 0x15a60ca3 in /usr/bin/clickhouse
6. DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*) @ 0x15a54b7e in /usr/bin/clickhouse
7. DB::PipelineExecutor::executeStep(std::__1::atomic<bool>*) @ 0x15a54280 in /usr/bin/clickhouse
8. DB::PushingPipelineExecutor::start() @ 0x15a68d14 in /usr/bin/clickhouse
9. DB::SystemLog<DB::MetricLogElement>::flushImpl(std::__1::vector<DB::MetricLogElement, std::__1::allocator<DB::MetricLogElement> > const&, unsigned long) @ 0x14d9b618 in /usr/bin/clickhouse
10. DB::SystemLog<DB::MetricLogElement>::savingThreadFunction() @ 0x14d99ab5 in /usr/bin/clickhouse
11. ? @ 0xaf80cb1 in /usr/bin/clickhouse
12. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xaf62837 in /usr/bin/clickhouse
13. ? @ 0xaf662fd in /usr/bin/clickhouse
14. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so
15. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so
 (version 22.2.2.1)
2022.12.08 15:07:09.448149 [ 18117 ] {} <Error> Application: DB::Exception: Suspiciously many (12 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 10. You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value): Cannot attach table `radar`.`signal_status` from metadata file /var/lib/clickhouse/store/422/4222e684-3a04-4de6-bacc-879c855ef94c/signal_status.sql from query ATTACH TABLE radar.signal_status UUID '85937e46-5e81-4d48-bcfc-dbd2a738c086' (`time_stamp` DateTime COMMENT '信控上報時間(目前燈色開始時間)', `intersection_number` Int32 COMMENT '交叉口編号', `pattern_number` Int16 COMMENT '方案編号', `working_mode` String COMMENT '信号機工作模式', `stage_index` Int8 COMMENT '目前階段序号', `phase_number` Int8 COMMENT '目前執行相位', `stage_status` String COMMENT '目前階段燈色', `len_lamp_light` Int16 COMMENT '目前燈色的持續時間', `cycle` Int16 COMMENT '方案周期(秒)', `green_movement` Array(String) COMMENT '目前相位綠燈可通行進口流向', `green_lane` Array(String) COMMENT '目前相位綠燈可通行進口車道編号', `stage_end_time` DateTime COMMENT '目前燈色結束時間', `cycle_start_time` DateTime COMMENT '本周期開始時間') ENGINE = MergeTree PARTITION BY toYYYYMM(time_stamp) PRIMARY KEY time_stamp ORDER BY (time_stamp, intersection_number) TTL time_stamp + toIntervalMonth(6) SETTINGS index_granularity = 8192, old_parts_lifetime = 300
           

原因

  • 這個是發生在機器斷電場景下的報錯,查找原因是說因為寫入資料造成的中繼資料和資料不一緻問題
  • clickhouse在重新開機服務的時候會重新加載MergeTree表引擎資料,資料可能存在損壞情況
  • clickhouse配置原因

    配置參數當中包含一個參數max_suspicious_broken_parts,預設值是10,可選值範圍是任意正整數,如果單個分區中的損壞部分數量超過max_suspicious_broken_parts 配置的值,則拒絕自動修複或者拒絕删除損壞部分的資料,并且服務啟動時候直接報錯退出

  • 目前需要盡量避免該錯誤以免服務啟動失敗,推薦把該參數配置為1000或者更大的值

解決

  • 我是按照提示,找到對應檔案

    /var/lib/clickhouse/store/422/4222e684-3a04-4de6-bacc-879c855ef94c/signal_status.sql

  • 修改了表配置,設定max_suspicious_broken_parts=20(大于損壞檔案數量)
  • 重新開機資料庫就好了

其他處理方式

單表配置方式

在建立MergeTree表的時候特别配置一下max_suspicious_broken_parts參數

CREATE TABLE foo
(
    `A` Int64
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS max_suspicious_broken_parts = 1000;
           

指令行方式

使用ALTER TABLE … MODIFY SETTING指令修改

ALTER TABLE foo
    MODIFY SETTING max_suspicious_broken_parts = 1000;

-- 恢複預設值
-- reset to default (use value from system.merge_tree_settings)
ALTER TABLE foo
    RESET SETTING max_suspicious_broken_parts;
           

配置檔案方式

如果服務起不來了,就隻能使用這個方式解決

  • 建立檔案max_suspicious_broken_parts.xml寫入如下内容
<?xml version="1.0"?>
<yandex>
     <merge_tree>
         <max_suspicious_broken_parts>1000</max_suspicious_broken_parts>
     </merge_tree>
</yandex>
           
  • clickhouse的配置檔案推薦放置在/etc/clickhouse-server/config.d/檔案夾下生效
  • 如果是在Ubuntu或者Centos上面以DEB或RPM安裝包的形式啟動的,需要把該檔案放到/etc/clickhouse-server/config.d/,最後重新開機clickhouse就可以了
  • 如果是docker compose方式啟動

    修改compose.yaml配置如下,主要也是把對應檔案挂載進入容器内部相應位置

services:
  clickhouse:
    image: clickhouse/clickhouse-server
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: always
    container_name: demo-clickhouse
    environment:
    - CLICKHOUSE_USER=demo
    - CLICKHOUSE_PASSWORD=demo-pass
    - CLICKHOUSE_DB=demo
    ports:
      - "8123:8123"
      - "9000:9000"
    volumes:
      - ./max_suspicious_broken_parts.xml:/etc/clickhouse-server/config.d/max_suspicious_broken_parts.xml
      - demo-clickhouse:/var/lib/clickhouse
    healthcheck:
      test: 'wget -O - http://127.0.0.1:8123 || exit 1'
     
volumes:
  demo-clickhouse: {}
           

驗證配置是否生效

  • 連接配接clickhouse,查詢
SELECT *
FROM system.merge_tree_settings
WHERE name LIKE '%max_suspicious_broken_parts%'
           

參考

  • 解決clickhouse伺服器啟動異常Suspiciously many broken parts to remove
  • 官方文檔:MergeTree table settings

繼續閱讀