作者: cchouqiang
背景
資料庫叢集為rawkv,沒有tidb節點,版本為v5.0.4,計劃更新到v6.1.0,利用現有的叢集機器進行更新,沒有額外的機器可用,還要考慮回退時間最短。
更新原因:
由于v5.0.4版本使用的災備是基于learner角色來實作的,本身還處于一個叢集;想用TiCDC來實作容災功能,隻能更新到v6.1.0版本。
環境準備
v6.1.0軟體下載下傳位址:
https://pingcap.com/zh/product-community
更新步驟
1)停止業務通路現有RawKV叢集(記為:A叢集)
觀察grpc沒有連接配接,并更新 javaclient 3.3版本(由于沒有tidb節點,通路tikv時用javaclient)
2)備份現有叢集(A叢集)資料(使用v5.0.4版本br備份)
br backup raw --pd $PD_ADDR \
-s "local:///nas \
--ratelimit 128 \
--format hex \
--cf default
3)關閉A叢集
tiup cluster stop $cluster_name
4)建立一套v5.0.4叢集記為B叢集
B叢集的pd、tikv、監控等元件部署目錄跟原叢集不一緻,端口保持一緻,與A叢集的拓撲保持一緻
部署拓撲檔案deploy.yaml:
global:
user: "tidb"
ssh_port: 22
deploy_dir: "/home/tidb_deploy"
data_dir: "/home/tidb_data"
os: linux
arch: amd64
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: "/home/tidb_deploy/monitored-9100"
data_dir: "/home/tidb_data/monitored-9100"
log_dir: "/home/tidb_deploy/monitored-9100/log"
server_configs:
tidb: {}
tikv:
pd.enable-forwarding: true
raftdb.defaultcf.hard-pending-compaction-bytes-limit: 4096GB
raftdb.defaultcf.compression-per-level: ["lz4","lz4","lz4","lz4","lz4","lz4","lz4"]
raftdb.defaultcf.soft-pending-compaction-bytes-limit: 4096GB
raftdb.defaultcf.level0-slowdown-writes-trigger: 32
raftdb.defaultcf.level0-stop-writes-trigger: 1024
raftdb.defaultcf.max-write-buffer-number: 16
raftdb.defaultcf.write-buffer-size: 256MB
raftdb.max-background-jobs: 4
raftdb.max-sub-compactions: 2
raftdb.use-direct-io-for-flush-and-compaction: true
raftstore.apply-pool-size: 6
raftstore.hibernate-regions: false
raftstore.pd-heartbeat-tick-interval: 5s
raftstore.raft-base-tick-interval: 200ms
raftstore.raft-election-timeout-ticks: 5
raftstore.raft-heartbeat-ticks: 1
raftstore.raft-max-inflight-msgs: 2048
raftstore.raft-store-max-leader-lease: 800ms
raftstore.store-pool-size: 4
readpool.storage.use-unified-pool: true
readpool.unified.max-thread-count: 10
rocksdb.defaultcf.block-size: 8KB
rocksdb.defaultcf.hard-pending-compaction-bytes-limit: 4096GB
rocksdb.defaultcf.max-bytes-for-level-base: 1GB
rocksdb.defaultcf.soft-pending-compaction-bytes-limit: 4096GB
rocksdb.defaultcf.level0-slowdown-writes-trigger: 32
rocksdb.defaultcf.level0-stop-writes-trigger: 1024
rocksdb.defaultcf.max-write-buffer-number: 16
rocksdb.defaultcf.write-buffer-size: 1024MB
rocksdb.defaultcf.compression-per-level: ["lz4","lz4","lz4","lz4","lz4","lz4","lz4"]
rocksdb.max-background-jobs: 16
rocksdb.max-sub-compactions: 4
rocksdb.rate-bytes-per-sec: 500MB
rocksdb.use-direct-io-for-flush-and-compaction: true
server.grpc-raft-conn-num: 3
storage.enable-ttl: true
storage.ttl-check-poll-interval: 24h
log-level: info
readpool.unified.max-thread-count: 20
storage.block-cache.capacity: "100GB"
server.grpc-concurrency: 4
pd:
election-interval: 1s
lease: 1
tick-interval: 200ms
max-replicas: 5
replication.location-labels: ["zone","rack","host"]
pd_servers:
- host: xx.xx.xx.xx
ssh_port: 22
name: pd_ip-6
client_port: 2379
peer_port: 2380
deploy_dir: /home/tidb_deploy/pd-2379
data_dir: /home/tidb_data/pd-2379
log_dir: /home/tidb_deploy/pd-2379/log
arch: amd64
os: linux
tikv_servers:
- host: xx.xx.xx.xx
port: 20160
status_port: 20180
deploy_dir: "/home/tidb_deploy/tikv-20160"
data_dir: "/home/tidb_data/tikv-20160"
log_dir: "/home/tidb_deploy/tikv-20160/log"
config:
server.labels:
rack: az1
host: ip-7
arch: amd64
os: linux
- host: xx.xx.xx.xx
port: 20161
status_port: 20181
deploy_dir: "/home/tidb_deploy/tikv-20160"
data_dir: "/home/tidb_data/tikv-20160"
log_dir: "/home/tidb_deploy/tikv-20160/log"
config:
server.labels:
rack: az1
host: ip-8
arch: amd64
os: linux
- host: xx.xx.xx.xx
port: 20160
status_port: 20180
deploy_dir: "/home/tidb_deploy/tikv-20160"
data_dir: "/home/tidb_data/tikv-20160"
log_dir: "/home/tidb_deploy/tikv-20160/log"
config:
server.labels:
rack: az1
host: ip-9
arch: amd64
os: linux
monitoring_servers:
- host: xx.xx.xx.xx
ssh_port: 22
port: 9090
deploy_dir: "/home/tidb_deploy/prometheus-8249"
data_dir: "/home/tidb_data/prometheus-8249"
log_dir: "/home/tidb_deploy/prometheus-8249/log"
arch: amd64
os: linux
grafana_servers:
- host: xx.xx.xx.xx
port: 3000
deploy_dir: /home/tidb_deploy/grafana-3000
arch: amd64
os: linux
部署叢集,需要在一台無tiup的機器上安裝tiup工具,因為同一個tiup是無法建立相同端口的叢集
$ scp -r ~/.tiup [email protected]:~/.tiup
$ tiup cluster deploy backup v5.0.4 ./deploy.yaml --user tidb -p -i /home/tidb/.ssh/id_rsa
$ tiup cluster start backup
5)将A叢集的備份資料恢複至B叢集
br restore raw \
--pd "${PDIP}:2379" \
--storage "local:///nas" \
--ratelimit 128 \
--format hex --log-file restore.log
6)将B叢集更新至RawKV 6.1
#先更新tiup元件、TiUP Cluster 元件版本
tar xzvf tidb-community-server-v6.1.0-linux-amd64.tar.gz
sh tidb-community-server-v6.1.0-linux-amd64/local_install.sh
source /home/tidb/.bash_profile
tiup update cluster
#檢查tiup更新後的版本
tiup --version
tiup cluster --version
#進行停機更新,首先需要将整個叢集關停。
$ tiup cluster stop backup
#通過 upgrade 指令添加 --offline 參數來進行停機更新。
$ tiup cluster upgrade backup v6.1.0 --offline
#更新完成後叢集不會自動啟動,需要使用 start 指令來啟動叢集。
$ tiup cluster start backup
7)更新完後,最好重新開機下叢集,并檢查下tikv版本
tiup ctl:v6.1.0 pd -u xx.xx.xx.xx:23791 store|grep version
8)啟動業務通路B叢集
讓業務進行驗證
9)驗證叢集正常使用後清理A叢集
業務驗證正常後,将A叢集進行清理:
tiup cluster destroy tidb-cluster
tiup cluster reload backup
tiup cluster rename backup tidb-v6
回退方案
如果發生了其他原因,導緻需要回退,則采用以下回退方案:
- 停止現有業務通路;
<!---->
- 銷毀RawKV 6.1叢集(即B叢集);
<!---->
- 啟動原RawKV 5.0.4叢集(即A叢集);
若啟動報錯,缺少services檔案時,可以使用 tiup cluster enable tidb-cluster解決。
4. 啟動應用通路叢集(A叢集)
總結和思考
資料庫更新時,首先要考慮資料安全,還要保障停機時間最短;