PostgreSQL 9.1流式复制还原_命令:退出代码255的特殊含义?

PostgreSQL 9.1流式复制还原_命令:退出代码255的特殊含义?,postgresql,database-replication,postgresql-9.1,Postgresql,Database Replication,Postgresql 9.1,我在Ubuntu 10.04.2 LTS(主和备用)上安装了PostgreSQL 9.1.3流式复制。使用流式基本备份()初始化复制。脚本尝试使用rsync从远程存档位置获取所需的WAL存档 当restore_命令脚本失败且退出代码为255时,所有操作都如中所述: 启动时,待机状态首先恢复存档位置中的所有可用WAL,调用restore\u命令。一旦到达WAL的末尾并且restore_命令失败,它将尝试恢复pg_xlog目录中的任何可用WAL。如果失败,并且已配置流式复制,则备用服务器将尝试连接到

我在Ubuntu 10.04.2 LTS(主和备用)上安装了PostgreSQL 9.1.3流式复制。使用流式基本备份()初始化复制。脚本尝试使用
rsync
从远程存档位置获取所需的WAL存档

当restore_命令脚本失败且退出代码为255时,所有操作都如中所述:

启动时,待机状态首先恢复存档位置中的所有可用WAL,调用restore\u命令。一旦到达WAL的末尾并且restore_命令失败,它将尝试恢复pg_xlog目录中的任何可用WAL。如果失败,并且已配置流式复制,则备用服务器将尝试连接到主服务器,并从存档或pg_xlog中找到的最后一条有效记录开始流式复制。如果该操作失败或未配置流式复制,或者如果稍后断开连接,则待机设备将返回到步骤1,并再次尝试从存档恢复文件。从存档、pg_xlog和通过流式复制进行重试的循环一直持续到服务器停止或由触发器文件触发故障转移

但是,当restore_命令脚本失败,退出代码为255(因为脚本返回失败的rsync调用的退出代码)时,服务器进程将死亡,并出现以下错误:

2012-05-09 23:21:30 CEST - @  LOG:  database system was interrupted; last known up at     2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - @  LOG:  entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - @  FATAL:  could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - @  LOG:  startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - @  LOG:  aborting startup due to startup process failure

所以我现在的问题是:这是一个bug,还是退出代码255的特殊含义在其他优秀的文档中丢失了,还是我在这里丢失了其他内容?

在主服务器上,您的
WAL
文件位于
pg\u xlog/
目录中。当
WAL
文件存在时,PostgreSQL能够在需要时将它们发送到备用数据库

通常,您还有本地存档的
WAL
位置,当PostgreSQL将文件移动到该位置时,这些文件将无法再在线传送到备用计算机,备用计算机希望这些文件通过
restore\u命令从存档的
WAL
位置传来

如果您在主服务器和备用服务器上有不同的归档
WAL
s设置位置,那么在一段时间内无法到达备用服务器,您会有一个缺口

在您的情况下,这可能意味着:

  • 0000000 10000000000003d
    已由主PostgreSQL存档
  • 备用的
    restore\u命令
    在配置的源位置看不到它

您可以考虑使用“<代码> SCP < /代码>或<代码> RSYNC < /代码>手动将缺少的WAL文件从主文件复制到待机状态。还可能需要查看您的

WAL
位置,并确保两台服务器的方向相同


编辑:
grep
-ing对于源代码中的
restore\u命令
,只有
access/transam/xlog.c
引用它。在函数
RestoreArchivedFile
的末尾(9.1.3源代码的第3115行),检查
restore\u命令是否正常退出或是否收到信号

在第一种情况下,消息被分类为
DEBUG2
。如果
restore\u命令
接收到一个信号,而不是
SIGTERM
(我想是无法正确处理),将报告一个
致命的
错误。这适用于大于125的所有代码

但我无法告诉您原因。

我建议在上询问。

这看起来像是我在使用NFS时临时遇到的rsync问题(在端口837上使用rpcbind/rstatd):

这为我解决了这个问题:

service rpcbind stop

我在创建热备份时遇到了同样的问题(postgres 9.5)。流媒体正在工作(我通过pg_basebackup使用稍后在备用的recovery.conf中使用的相同凭据为备用设置种子)

在进行basebackup之后,我设置了以下recovery.conf:

standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user@ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'
启动服务器将产生:

2016-03-08 12:34:58.981 UTC  (/)LOG:  database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC  (/)FATAL:  could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC  (/)LOG:  startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC  (/)LOG:  aborting startup due to startup process failure
如果我从recovey.conf中删除restore_命令行,那么备用设备将正常启动,并开始从主设备中传输WAL

我最终将问题归结为没有将备用postgres用户的公钥添加到WAL存档主机的授权_hosts文件中。我还忘了将WAL存档主机的服务器指纹添加到备用postgres用户的已知\u hosts文件中

这两个错误(我假设)导致sftp restore_命令退出,代码为255。正如tscho所说,Postgres的文档表明,如果restore_命令以任何非零值退出,Postgres只需继续尝试从主服务器流式传输,而不是拒绝启动。实际上,如果退出代码高于某个数字(可能是125,正如vyegorov的源代码grepping所建议的那样?),则情况似乎并非如此


修复了两个SSH问题后,restore_命令出现在recovery.conf中,待机状态就可以正常启动。

下面是一条注释,描述了为什么选择命令进程的高退出状态的这种行为,以及实现它的当前代码

    /*
     * Remember, we rollforward UNTIL the restore fails so failure here is
     * just part of the process... that makes it difficult to determine
     * whether the restore failed because there isn't an archive to restore,
     * or because the administrator has specified the restore program
     * incorrectly.  We have to assume the former.
     *
     * However, if the failure was due to any sort of signal, it's best to
     * punt and abort recovery.  (If we "return false" here, upper levels will
     * assume that recovery is complete and start up the database!) It's
     * essential to abort on child SIGINT and SIGQUIT, because per spec
     * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
     * those it's a good bet we should have gotten it too.
     *
     * On SIGTERM, assume we have received a fast shutdown request, and exit
     * cleanly. It's pure chance whether we receive the SIGTERM first, or the
     * child process. If we receive it first, the signal handler will call
     * proc_exit, otherwise we do it here. If we or the child process received
     * SIGTERM for any other reason than a fast shutdown request, postmaster
     * will perform an immediate shutdown when it sees us exiting
     * unexpectedly.
     *
     * Per the Single Unix Spec, shells report exit status > 128 when a called
     * command died on a signal.  Also, 126 and 127 are used to report
     * problems such as an unfindable command; treat those as fatal errors
     * too.
     */
    if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
        proc_exit(1);

    signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

    ereport(signaled ? FATAL : DEBUG2,
            (errmsg("could not restore file \"%s\" from archive: %s",
                    xlogfname, wait_result_to_str(rc))));

我现在不打算回答这个问题,因为我没有时间检查源代码,直到稍后或明天才能确认,但我记得,当在恢复期间应用WAL文件时,小于255的非零退出代码表示“失败但继续尝试”,而255(或更高)表示“严重失败;放弃”。您可能需要调整脚本,以便为rsync失败返回较少的退出代码。@kgrittn:谢谢,我曾想过这样做,但我找不到任何关于退出代码255特殊含义的文档,我不知道在源代码中从何处查找它。呃,花了一段时间,但这个问题再次浮出水面,我不得不处理,我在这里的评论被引用了,所以我查了一下,并发布了一个详细的答案。我会考虑在文档中加入一些内容
    /*
     * Remember, we rollforward UNTIL the restore fails so failure here is
     * just part of the process... that makes it difficult to determine
     * whether the restore failed because there isn't an archive to restore,
     * or because the administrator has specified the restore program
     * incorrectly.  We have to assume the former.
     *
     * However, if the failure was due to any sort of signal, it's best to
     * punt and abort recovery.  (If we "return false" here, upper levels will
     * assume that recovery is complete and start up the database!) It's
     * essential to abort on child SIGINT and SIGQUIT, because per spec
     * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
     * those it's a good bet we should have gotten it too.
     *
     * On SIGTERM, assume we have received a fast shutdown request, and exit
     * cleanly. It's pure chance whether we receive the SIGTERM first, or the
     * child process. If we receive it first, the signal handler will call
     * proc_exit, otherwise we do it here. If we or the child process received
     * SIGTERM for any other reason than a fast shutdown request, postmaster
     * will perform an immediate shutdown when it sees us exiting
     * unexpectedly.
     *
     * Per the Single Unix Spec, shells report exit status > 128 when a called
     * command died on a signal.  Also, 126 and 127 are used to report
     * problems such as an unfindable command; treat those as fatal errors
     * too.
     */
    if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
        proc_exit(1);

    signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

    ereport(signaled ? FATAL : DEBUG2,
            (errmsg("could not restore file \"%s\" from archive: %s",
                    xlogfname, wait_result_to_str(rc))));