天天看点

greenplum 节点故障修复文档

greenplum集群说明

greenplum整个集群是由多台服务器组合而成,任何一台服务都有可能发生软件或硬件故障,我们一起来模拟一下任何一个节点或服务器故障后,greenplumn的容错及恢复方法

master  服务器:l-test5
standby 服务器:l-test6
segment 服务器:(primary + mirror):l-test[7-12]
 
greenplum集群说明:1 master  +  1 standby + 24 primary segments + 24 mirror segments           

正常集群状态

在master查看数据库当前的状态
[gpadmin@l-test5 ~]$ gpstate
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.. 
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-Greenplum instance status summary
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Master instance                                           = Active
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Master standby                                            = l-test6
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Standby master state                                      = Standby host passive
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total segment instance count from metadata                = 48
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Primary Segment Status
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segments                                    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segment valid (at master)                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segment failures (at master)                = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Mirror Segment Status
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segments                                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
           

master 服务器故障

当master节点故障后,我们需要激活standby节点作为新的master节点(如果服务器允许,把vip也切换到standby服务器)

在激活standby节点的可以直接指定新的standby节点,也可以等原master服务器恢复后,指定原master节点为standby节点

这里我就不关服务器了,直接模拟master节点故障的状态
关闭master节点

[gpadmin@l-test5 ~]$ gpstop -a -m 
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Starting gpstop with args: -a -m
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Gathering information and validating the environment...
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-There are 0 connections to the database
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart'
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Master host=l-test5
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Commencing Master instance shutdown with mode=smart
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Master segment instance directory=/export/gp_data/master/gpseg-1
20180211:15:16:25:022773 gpstop:l-test5:gpadmin-[INFO]:-Attempting forceful termination of any leftover master process
20180211:15:16:25:022773 gpstop:l-test5:gpadmin-[INFO]:-Terminating processes for segment /export/gp_data/master/gpseg-1

[gpadmin@l-test5 ~]$ psql
psql: could not connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/tmp/.s.PGSQL.65432"?

激活standby节点

[gpadmin@l-test6 ~]$ gpactivatestandby -a -d /export/gp_data/master/gpseg-1/
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby data directory    = /export/gp_data/master/gpseg-1
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby port              = 65432
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby running           = yes
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Force standby activation  = no
.
.
.

切换服务器vip

原master节点服务器卸载vip:
[root@l-test5 ~]# 
[root@l-test5 ~]# ip a d 10.0.0.1/32 brd + dev bond0

-----------
原standby节点服务器挂载vip:

[root@l-test6 ~]# 
[root@l-test6 ~]# ip a a  10.0.0.1/32 brd + dev bond0 && arping -q -c 3 -U -I bond0 10.0.0.1

指定新的standby节点

我们指定原master节点为新的standby节点服务器

需要先删除原master的数据文件,然后重新执行初始化standby节点即可

[gpadmin@l-test6 ~]$ gpinitstandby -a -s l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Filespace directory already exists on host l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Failed to create standby
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Error initializing standby master: master data directory exists
----------注意,这步是在原master节点操作-------
[gpadmin@l-test5 ~]$ cd /export/gp_data/master/
[gpadmin@l-test5 /export/gp_data/master]$ ll
total 4
drwx------ 17 gpadmin gpadmin 4096 Feb 11 15:17 gpseg-1
[gpadmin@l-test5 /export/gp_data/master]$ rm -rf gpseg-1/
[gpadmin@l-test5 /export/gp_data/master]$ ll
total 0
[gpadmin@l-test5 /export/gp_data/master]$ 
----------

[gpadmin@l-test6 ~]$ gpinitstandby -a -s l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master initialization parameters
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master hostname               = l-test6
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum update system catalog         = On
.
.
           

standby 服务器故障

当standby节点服务器恢复后,需要将standby节点删除,然后重新初始化一下standby服务器即可

在master服务器上查看整个greenplum集群状态,我们发现[WARNING]:-Standby master state

[gpadmin@l-test5 ~]$ gpstate
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-Greenplum instance status summary
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-   Master instance                                           = Active
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-   Master standby                                            = l-test6
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[WARNING]:-Standby master state                                      = Standby host DOWN   
.
.
[gpadmin@l-test5 ~]$ gpstate -f
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -f
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Standby master details
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-----------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby address          = l-test6
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby data directory   = /export/gp_data/master/gpseg-1
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby port             = 65432
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[WARNING]:-Standby PID              = 10480                            Have lock file /tmp/.s.PGSQL.65432.lock but no process running on port 65432   <<<<<<<<
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[WARNING]:-Standby status           = Status could not be determined   <<<<<<<<
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--pg_stat_replication
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-No entries found.
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

1、删除故障的standby节点

[gpadmin@l-test5 ~]$ gpinitstandby -r -a
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Warm master standby removal parameters
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master hostname               = l-test5
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test6
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Removing standby master from catalog...
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Database catalog updated successfully.
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Removing filespace directories on standby master...
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Successfully removed standby master

2、重新初始化standby节点
[gpadmin@l-test5 ~]$ gpinitstandby -s l-test6 -a
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test6
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master initialization parameters
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master hostname               = l-test5
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test6
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum update system catalog         = On
.
.

3、检查standby 的配置信息

[gpadmin@l-test5 ~]$ gpstate -f
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -f
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Standby master details
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-----------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby address          = l-test6
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby data directory   = /export/gp_data/master/gpseg-1
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby port             = 65432
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby PID              = 19057
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby status           = Standby host passive
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--pg_stat_replication
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--WAL Sender State: streaming
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Sync state: sync
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Sent Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Flush Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Replay Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
           

segment 节点故障

当一个primary segment节点故障,那么它所对应的mirror segment节点会接替primary的状态,继续保证整个集群的数据完整性

当一个mirror segment节点出现故障,它不会影响整个集群的可用性,但是需要尽快修复,保证所有的primary segment都有备份

如果primary segment 和 它所对应的mirror segment 节点都出现故障,那么greenplum认为集群数据不完整,整个集群将不再提供服务,直到primary segment 或 mirror segment恢复

修复segment节点,需要重启greenplum集群

primary segment节点和mirror segment节点的故障修复方式是一样的,这里以mirror节点故障为例
关闭一个节点后集群状态

[gpadmin@l-test7 ~]$ pg_ctl -D /export/gp_data/mirror/data4/gpseg23 stop -m fast
waiting for server to shut down.... done
server stopped

--------master节点执行-------
[gpadmin@l-test5 ~]$ gpstate 
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.
.
.
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Mirror Segment Status
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segments                                     = 24
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total mirror segment failures (at master)                 = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of postmaster.pid files missing              = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of postmaster.pid PIDs missing               = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of /tmp lock files missing                   = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number postmaster processes missing                 = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 24
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------


[gpadmin@l-test5 ~]$ gpstate -m
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -m
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--Current GPDB mirror list and status
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--Type = Spread
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status    
.
.
.
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-   l-test12   /export/gp_data/mirror/data3/gpseg22   50002   Passive   Synchronized
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[WARNING]:-l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Failed                   <<<<<<<<
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[WARNING]:-1 segment(s) configured as mirror(s) have failed

 
[gpadmin@l-test5 ~]$ gpstate -e
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -e
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.. 
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-Segment Mirroring Status Report
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-Primaries in Change Tracking
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-   Current Primary    Port    Change tracking size   Mirror             Port
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-   l-test9   40003   128 bytes              l-test7   50003

关闭整个greenplum集群

[gpadmin@l-test5 ~]$ gpstop -a -M fast
.
.
.
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   Segments stopped successfully                              = 48
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   Segments with errors during stop                           = 0
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[WARNING]:-Segments that are currently marked down in configuration   = 1    <<<<<<<<
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-            (stop was still attempted on these segments)
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
.
.
.

以restricted方式启动数据库
[gpadmin@l-test5 ~]$ gpstart -a -R 
.
. 
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Process results...
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Successful segment starts                                            = 47
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Failed segment starts                                                = 0
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Skipped segment starts (segments are marked down in configuration)   = 0
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Successfully started 47 of 47 segment instances 
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Starting Master instance l-test5 directory /export/gp_data/master/gpseg-1 in RESTRICTED mode
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Command pg_ctl reports Master l-test5 instance active
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Starting standby master
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Checking if standby master is running on host: l-test6  in directory: /export/gp_data/master/gpseg-1
20180211:16:25:24:008571 gpstart:l-test5:gpadmin-[INFO]:-Database successfully started

开始修复故障节点

[gpadmin@l-test5 ~]$ gprecoverseg -a
.
.
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Greenplum instance recovery parameters
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Recovery type              = Standard
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Recovery 1 of 1
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Synchronization mode                        = Incremental
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance host                        = l-test7
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance address                     = l-test7
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance directory                   = /export/gp_data/mirror/data4/gpseg23
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance port                        = 50003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance replication port            = 51003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance host               = l-test9
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance address            = l-test9
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance directory          = /export/gp_data/primary/data4/gpseg23
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance port               = 40003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance replication port   = 41003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Target                             = in-place
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-1 segment(s) to recover
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Ensuring 1 failed segment(s) are stopped
. 
.

检查集群修复状态

一直等到Data Status 这个属性全部都是Synchronized即可进行下一步操作

[gpadmin@l-test5 ~]$ gpstate -m
.
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status       
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   l-test11   /export/gp_data/mirror/data1/gpseg0    50000   Passive   Synchronized
.
.
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Passive   Resynchronizing
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

[gpadmin@l-test5 ~]$ 
[gpadmin@l-test5 ~]$ gpstate -m
.
.
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status    
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   l-test11   /export/gp_data/mirror/data1/gpseg0    50000   Passive   Synchronized
.
.
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Passive   Synchronized
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

重启greenplum集群

[gpadmin@l-test5 ~]$ gpstop -a -r
.
.
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-   Segments stopped successfully      = 48
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-   Segments with errors during stop   = 0
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Successfully shutdown 48 of 48 segment instances 
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Database successfully shutdown with no errors reported
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover gpmmon process
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-No leftover gpmmon process found
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover gpsmon processes
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-No leftover gpsmon processes on some hosts. not attempting forceful termination on these hosts
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover shared memory
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Restarting System...