Postgresql 记录中2/XYZ+处的资源管理器数据校验和不正确;由于管理员命令而终止WALCeiver进程

Postgresql 记录中2/XYZ+处的资源管理器数据校验和不正确;由于管理员命令而终止WALCeiver进程,postgresql,replication,wal,Postgresql,Replication,Wal,我正在使用PostgreSQL 9.1(1个主服务器,3个从服务器)运行流式复制环境。aprox一切正常。2个月。昨天,到其中一个从属服务器的复制失败,从属服务器上的日志具有: LOG: incorrect resource manager data checksum in record at 61/DA2710A7 FATAL: terminating walreceiver process due to administrator command LOG: incorrect reso

我正在使用PostgreSQL 9.1(1个主服务器,3个从服务器)运行流式复制环境。aprox一切正常。2个月。昨天,到其中一个从属服务器的复制失败,从属服务器上的日志具有:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating walreceiver process due to administrator command
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
从机不再与主机同步两小时后,日志每隔5秒就会出现一行新的内容,我重新启动了从属数据库服务器:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down
从属服务器上的新日志文件包含:

LOG:  database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG:  entering standby mode
LOG:  redo starts at 61/D92C10C9
LOG:  consistent recovery state reached at  61/DA2710A7
LOG:  database system is ready to accept read only connections
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  streaming replication successfully connected to primary
现在,从机与主机同步,但校验和条目仍然存在。我还检查了网络日志->网络可用

我的问题是:

  • 有人知道为什么WAL接收器被终止了吗
  • 为什么PostgreSQL不重试复制
  • 我能做些什么来防止将来发生这种情况
  • 多谢各位

    编辑:

    数据库服务器使用ext3在SLES 11上运行。我发现了一篇关于SLES 11的低性能的文章,它有大内存,但我不确定它是否适用,因为我的机器只有8GB内存()

    任何帮助都将不胜感激

    编辑(2):

    PostgreSQL版本是9.1.5。PostgreSQL 9.1.6版似乎为类似问题提供了解决方案

    Fix persistence marking of shared buffers during WAL replay (Jeff Davis)
    
    This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.
    
    资料来源:


    这可能是解决办法吗?我是否应该升级到PostgreSQL 9.1.6,然后一切都会顺利运行?

    如果有人偶然发现这个问题,我最终会从备份的数据重新安装数据库,并再次设置复制。从来没有真正弄清楚出了什么问题

    从来没有真正弄清楚出了什么问题

    我也遇到了同样的错误——只是它从一开始就没有完全同步

    然后,主服务器出现了一些内核错误(服务器案例中的热问题?)。由于未完全关闭,服务器需要关闭。已经在关机的时候,奴隶出现了

    LOG:  incorrect resource manager data checksum in record at 1/63663CB0
    

    在重新启动主服务器和从服务器后,情况不会改变:每5秒都有相同的日志条目。

    我怀疑主服务器上的WAL文件已损坏,然后这些文件会传播到从服务器。有没有看到这个帖子的PostgreSQL人员可以证实这一点?