天天看点

PostgreSQL PITR 任意时间点恢复过程中如何手工得到recovery需要的下一个WAL文件名 - 默认情况下restore_command自动获取

标签

PostgreSQL , recovery , recovery.conf , restore_command , timeline , 时间线 , next wal , PITR , 时间点恢复

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E8%83%8C%E6%99%AF 背景

PostgreSQL数据库支持PITR时间点恢复。默认情况下,只需要配置目标是时间点,resotre_command即可,PG会自动调用resotre_command去找需要的WAL文件。

一个典型的recovery.conf配置如下:

#---------------------------------------------------------------------------  
# ARCHIVE RECOVERY PARAMETERS  
#---------------------------------------------------------------------------  
#  
# restore_command  
#  
# specifies the shell command that is executed to copy log files  
# back from archival storage.  The command string may contain %f,  
# which is replaced by the name of the desired log file, and %p,  
# which is replaced by the absolute path to copy the log file to.  
#  
# This parameter is *required* for an archive recovery, but optional  
# for streaming replication.  
#  
# It is important that the command return nonzero exit status on failure.  
# The command *will* be asked for log files that are not present in the  
# archive; it must return nonzero when so asked.  
#  
# NOTE that the basename of %p will be different from %f; do not  
# expect them to be interchangeable.  
#  
restore_command = 'cp /data01/digoal/wal/%f %p'  
  
#---------------------------------------------------------------------------  
# RECOVERY TARGET PARAMETERS  
#---------------------------------------------------------------------------  
#  
# By default, recovery will rollforward to the end of the WAL log.  
# If you want to stop rollforward at a specific point, you  
# must set a recovery target.  
#  
# You may set a recovery target either by transactionId, by name, by  
# timestamp or by WAL location (LSN). Recovery may either include or  
# exclude the transaction(s) with the recovery target value (i.e.,  
# stop either just after or just before the given target,  
# respectively).  
#  
#  
#recovery_target_name = ''      # e.g. 'daily backup 2011-01-26'  
#  
recovery_target_time = '2019-03-05 20:52:16.294366+08'      # e.g. '2004-07-14 22:39:00 EST'  
#  
#recovery_target_xid = ''  
#  
#recovery_target_lsn = ''       # e.g. '0/70006B8'  
#  
#recovery_target_inclusive = true  
  
recovery_target_timeline = 'latest'  
  
# If recovery_target_action = 'pause', recovery will pause when the  
# recovery target is reached. The pause state will continue until  
# pg_wal_replay_resume() is called. This setting has no effect if  
# no recovery target is set. If hot_standby is not enabled then the  
# server will shutdown instead, though you may request this in  
# any case by specifying 'shutdown'.  
#  
#recovery_target_action = 'pause'  
  
#---------------------------------------------------------------------------  
# STANDBY SERVER PARAMETERS  
#---------------------------------------------------------------------------  
#  
# standby_mode  
#  
# When standby_mode is enabled, the PostgreSQL server will work as a  
# standby. It will continuously wait for the additional XLOG records, using  
# restore_command and/or primary_conninfo.  
#  
standby_mode = on  
           

恢复目标支持:

1、时间

2、自定义还原点名字

3、事务ID

4、WAL LSN

《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 2》 《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 1》 《PostgreSQL recovery target introduce》

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E6%8E%A5%E4%B8%8B%E6%9D%A5%E7%9A%84%E9%97%AE%E9%A2%98%E5%A6%82%E6%9E%9C%E6%97%A0%E6%B3%95%E7%9B%B4%E6%8E%A5%E9%80%9A%E8%BF%87restore_command%E8%8E%B7%E5%8F%96%E6%96%87%E4%BB%B6%E5%8F%88%E5%BD%93%E5%A6%82%E4%BD%95%E5%91%A2 接下来的问题,如果无法直接通过restore_command获取文件,又当如何呢?

restore_command = 'cp /data01/digoal/wal/%f %p'  
           

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E6%96%B9%E6%B3%951%E9%80%9A%E8%BF%87restore_command%E5%90%90%E5%87%BA%E9%9C%80%E8%A6%81%E7%9A%84%E6%96%87%E4%BB%B6%E5%90%8D 方法1,通过restore_command吐出需要的文件名

recovery.conf  
  
restore_command = 'cp /data01/digoal/wal/%f %p || echo "`date +%F%T` %f" >> /tmp/needwalfile;'  
           

当找不到WAL文件时,就会吐到/tmp/needwalfile

cat /tmp/needwalfile  
  
2019-03-0522:11:28 000000010000005D000000B2  
2019-03-0522:11:28 00000002.history  
2019-03-0522:11:33 000000010000005D000000B2  
2019-03-0522:11:33 00000002.history  
2019-03-0522:11:38 000000010000005D000000B2  
2019-03-0522:11:38 00000002.history  
           

将文件拷贝到restore_command配置的/data01/digoal/wal目录,restore_command命令将继续。

优先拷贝history文件(走到新的时间线),原理参考末尾引用文档。

通过pg_is_wal_replay_paused()函数得到当前实例是否已经到达目标还原点。如果返回T,表示已到达,则不再需要给PG新的文件。

postgres=# select pg_is_wal_replay_paused();  
 pg_is_wal_replay_paused   
-------------------------  
 f  
(1 row)  
           

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E6%96%B9%E6%B3%952%E9%80%9A%E8%BF%87log%E6%96%87%E4%BB%B6%E5%BE%97%E5%88%B0%E9%9C%80%E8%A6%81%E7%9A%84wal%E6%96%87%E4%BB%B6%E5%90%8D 方法2,通过log文件得到需要的WAL文件名

配置PG的LOG文件,一样能得到上面的内容。

postgresql.conf  
  
# - Where to Log -  
  
log_destination = 'csvlog'              # Valid values are combinations of  
                                        # stderr, csvlog, syslog, and eventlog,  
                                        # depending on platform.  csvlog  
                                        # requires logging_collector to be on.  
  
# This is used when logging to stderr:  
logging_collector = on          # Enable capturing of stderr and csvlog  
                                        # into log files. Required to be on for  
                                        # csvlogs.  
                                        # (change requires restart)  
  
# These are only used if logging_collector is on:  
log_directory = 'log'                   # directory where log files are written,  
                                        # can be absolute or relative to PGDATA  
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'        # log file name pattern,  
log_filename='pg.log'  
                                        # can include strftime() escapes  
#log_file_mode = 0600                   # creation mode for log files,  
                                        # begin with 0 to use octal notation  
log_truncate_on_rotation = on           # If on, an existing log file with the  
                                        # same name as the new log file will be  
                                        # truncated rather than appended to.  
                                        # But such truncation only occurs on  
                                        # time-driven rotation, not on restarts  
                                        # or size-driven rotation.  Default is  
                                        # off, meaning append to existing files  
                                        # in all cases.  
log_rotation_age = 1d                   # Automatic rotation of logfiles will  
                                        # happen after that time.  0 disables.  
##log_rotation_size = 10MB              # Automatic rotation of logfiles will  
                                        # happen after that much log output.  
                                        # 0 disables.  
           

当找不到WAL文件时,就会吐到$PGDATA/log/pg.log

digoal@pg11-test-> cat pg.log  
  
2019-03-05 22:14:00.167 CST [38155] LOG:  00000: ending log output to stderr  
2019-03-05 22:14:00.167 CST [38155] HINT:  Future log output will go to log destination "csvlog".  
2019-03-05 22:14:00.167 CST [38155] LOCATION:  PostmasterMain, postmaster.c:1298  
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory  
cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory  
           
postgres=# select pg_is_wal_replay_paused();  
 pg_is_wal_replay_paused   
-------------------------  
 f  
(1 row)  
           

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E6%96%B9%E6%B3%953%E4%BF%AE%E6%94%B9%E5%86%85%E6%A0%B8%E9%80%9A%E8%BF%87udf%E6%94%AF%E6%8C%81 方法3,修改内核,通过UDF支持

例如,直接从UDF中获取当前startup进程需要的WAL文件名和TL history文件名。

src/backend/access/transam/xlog.c

UDF支持的弊端:当数据库还没有进入一致状态时,并不能连接到数据库执行查询,另外如果没有开启hot_standby模式,也不能连到恢复中的从库进行查询。使用场景受限。

开启数据库的hot_standby模式,确保可以在恢复过程中,连接到数据库进行UDF查询。

# These settings are ignored on a master server.  
  
hot_standby = on                        # "off" disallows queries during recovery  
                                        # (change requires restart)  
#max_standby_archive_delay = 30s        # max delay before canceling queries  
                                        # when reading WAL from archive;  
                                        # -1 allows indefinite delay  
#max_standby_streaming_delay = 30s      # max delay before canceling queries  
                                        # when reading streaming WAL;  
                                        # -1 allows indefinite delay  
wal_receiver_status_interval = 1s       # send replies at least this often  
                                        # 0 disables  
#hot_standby_feedback = off             # send info from standby to prevent  
                                        # query conflicts  
#wal_receiver_timeout = 60s             # time that receiver waits for  
                                        # communication from master  
                                        # in milliseconds; 0 disables  
#wal_retrieve_retry_interval = 5s       # time to wait before retrying to  
                                        # retrieve WAL after a failed attempt  
           

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E5%85%B6%E4%BB%96%E7%9F%A5%E8%AF%86%E7%82%B9 其他知识点

时间点恢复,如何取下一个文件。

将数据库恢复配置为hot_standby模式,允许在数据库恢复过程中,连接到数据库。获取需要的信息。

1、当返回database is in startup mode,表示无法连接数据库时,说明还需要日志文件,数据库才能到一致性点允许连接,此时,除了从前面说的LOG文件中获得需要的文件,实际上进程也会突出对应的内容。

例如

digoal   25596 25594  0 19:54 ?        00:00:00 postgres: startup   recovering 000000010000005D000000A7  
  
digoal    20   0 16.684g   1860   1276 S   0.0  0.0   0:02.59 postgres: startup   waiting for 000000010000005D000000B2    
           

日志的内容如下

cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A7’: No such file or directory  
cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory  
           

当我们将需要的归档拷贝到对应目录后,

digoal@pg11-test-> cp wal/000000010000005D000000A7 waltest/  
           

当我们将需要的归档拷贝到对应目录后,需要的WAL文件向前推移,日志的内容如下

cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A8’: No such file or directory  
cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory  
           

2、当可以连接恢复中的数据库后,可以通过一些系统函数,查看到数据库的一些信息

2.1、查看当前数据库正在replay 的wal LSN

postgres=# select pg_last_wal_replay_lsn();  
 pg_last_wal_replay_lsn   
------------------------  
 5D/A7FFFFE0  
(1 row)    
           

2.2、查看当前数据库的恢复是否pause,(如果是自动pause的,说明已经到达设置的还原点)

postgres=# select pg_is_wal_replay_paused();  
 pg_is_wal_replay_paused   
-------------------------  
 f  
(1 row)  
           

2.3、查看lsn对应的wal文件,不允许在standby实例中执行,如果能执行的话,可以直接从当前数据库正在replay 的wal LSN得到WAL文件名。

postgres=# select * from pg_walfile_name('5D/A7FFFFE0');  
ERROR:  55000: recovery is in progress  
HINT:  pg_walfile_name() cannot be executed during recovery.  
LOCATION:  pg_walfile_name, xlogfuncs.c:521  
           

2.4、当前数据库的时间线 (history)

postgres=# select * from pg_control_checkpoint();  
-[ RECORD 1 ]--------+-------------------------  
checkpoint_lsn       | 5D/A7000028  
redo_lsn             | 5D/A7000028  
redo_wal_file        | 000000010000005D000000A7  
timeline_id          | 1  
prev_timeline_id     | 1  
full_page_writes     | t  
next_xid             | 0:1286297007  
next_oid             | 1912406  
next_multixact_id    | 1  
next_multi_offset    | 0  
oldest_xid           | 101420357  
oldest_xid_dbid      | 13285  
oldest_active_xid    | 0  
oldest_multi_xid     | 1  
oldest_multi_dbid    | 1910618  
oldest_commit_ts_xid | 0  
newest_commit_ts_xid | 0  
checkpoint_time      | 2019-03-05 19:44:51+08  
           

2.5、控制文件内容

postgres=# select * from pg_control_system();  
-[ RECORD 1 ]------------+-----------------------  
pg_control_version       | 1100  
catalog_version_no       | 201809051  
system_identifier        | 6636510237226062864  
pg_control_last_modified | 2019-03-05 19:54:56+08  
           

2.6、当前实例如果重启,需要的最早的REDO。

postgres=# select * from pg_control_recovery();

-[ RECORD 1 ]-----------------+------------

min_recovery_end_lsn | 5D/A7FFFFE0

min_recovery_end_timeline | 1

backup_start_lsn | 0/0

backup_end_lsn | 0/0

end_of_backup_record_required | f

2.7、从当前wal目录中,获取到最大的WAL文件名,通常会是当前需要的WAL或者上一个已经REPLAY万的WAL文件。

postgres=# select * from pg_ls_waldir() order by 1 desc limit 1;  
           name           |   size   |      modification        
--------------------------+----------+------------------------  
 000000010000005D000000A7 | 16777216 | 2019-03-05 19:48:53+08  
(1 row)  
           

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E6%97%B6%E9%97%B4%E7%82%B9%E6%81%A2%E5%A4%8D%E6%89%8B%E5%B7%A5%E6%8B%B7%E8%B4%9Dwal%E6%96%87%E4%BB%B6%E7%9A%84%E6%B5%81%E7%A8%8B---%E9%80%9A%E5%B8%B8%E4%B8%8D%E9%9C%80%E8%A6%81%E6%89%8B%E5%B7%A5%E6%8B%B7%E8%B4%9D%E5%8F%AA%E8%A6%81%E6%8C%87%E5%AE%9Arestore_command%E8%AE%A9%E6%95%B0%E6%8D%AE%E5%BA%93%E8%87%AA%E5%B7%B1%E6%9D%A5%E5%8D%B3%E5%8F%AF 时间点恢复,手工拷贝wal文件的流程 - 通常不需要手工拷贝,只要指定restore_command让数据库自己来即可

配置recovery.conf

1、配置恢复目标

2、配置restore_command命令,打印下一个需要的WAL文件以及HISTORY文件,输出到某个文件中。参考方法1。

3、配置pause

4、配置打开hot_standby

5、从restore_command命令输出到某个文件中得到。下一个需要的WAL文件以及HISTORY文件。

优先拷贝history文件,防止走错时间线。

如果history文件确实存在并拷贝成功,下一个拷贝的文件是.partial文件,千万不要搞错。

6、通过pg_is_wal_replay_paused判断是否停止

postgres=# select pg_is_wal_replay_paused();  
 pg_is_wal_replay_paused   
-------------------------  
 f  
(1 row)  
           

如果返回T,表示已经到达还原点,不需要在拷贝文件。

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E5%8F%82%E8%80%83 参考

《PostgreSQL 时间点恢复(PITR)在异步流复制主从模式下,如何避免主备切换后PITR恢复(备库、容灾节点、只读节点)走错时间线(timeline , history , partial , restore_command , recovery.conf)》

https://github.com/digoal/blog/blob/master/201903/20190305_01.md#%E5%85%8D%E8%B4%B9%E9%A2%86%E5%8F%96%E9%98%BF%E9%87%8C%E4%BA%91rds-postgresql%E5%AE%9E%E4%BE%8Becs%E8%99%9A%E6%8B%9F%E6%9C%BA 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机

PostgreSQL PITR 任意时间点恢复过程中如何手工得到recovery需要的下一个WAL文件名 - 默认情况下restore_command自动获取