Linux 起搏器子过程终止

Linux 起搏器子过程终止,linux,redhat,high-availability,pacemaker,Linux,Redhat,High Availability,Pacemaker,环境: 操作系统:redhat 7.5 起搏器1.1.19-8.el7_6.1.x86_64 corosync:2.4.3-2.el7_5.1.x86_64 8个节点设置,一个资源为PostgreSQL 10,一个备用节点。 主数据库节点:DB1 辅助数据库节点:DB2 问题: PostgreSQL突然停止。 在DB1的corosync日志中: Dec 17 13:26:08 [435476] DB1 crmd: error: crm_ipc_read: Connectio

环境: 操作系统:redhat 7.5 起搏器1.1.19-8.el7_6.1.x86_64 corosync:2.4.3-2.el7_5.1.x86_64

8个节点设置,一个资源为PostgreSQL 10,一个备用节点。 主数据库节点:DB1 辅助数据库节点:DB2

问题: PostgreSQL突然停止。 在DB1的corosync日志中:

Dec 17 13:26:08 [435476] DB1       crmd:    error: crm_ipc_read:   Connection to pengine failed
Dec 17 13:26:08 [435476] DB1       crmd:    error: mainloop_gio_callback:  Connection to pengine[0x55e53f095dc0] closed (I/O condition=25)
Dec 17 13:26:08 [435476] DB1       crmd:     crit: pe_ipc_destroy: Connection to the Policy Engine failed | pid=-1 uuid=562cd7f3-9093-4626-adec-cd8b376e1b13
Dec 17 13:26:08 [435476] DB1       crmd:     info: register_fsa_error_adv: Resetting the current action list
Dec 17 13:26:09 [435459] DB1 pacemakerd:  warning: pcmk_child_exit:        The pengine process (435475) terminated with signal 9 (core=0)
Dec 17 13:26:09 [435476] DB1       crmd:   notice: save_cib_contents:      Saved Cluster Information Base to /var/lib/pacemaker/pengine/pe-core-562cd7f3-9093-4626-adec-cd8b376e1b13.bz2 after Policy Engine crash
Dec 17 13:26:09 [435459] DB1 pacemakerd:   notice: pcmk_process_exit:      Respawning failed child process: pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Using uid=189 and group=189 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: start_child:    Forked child 126688 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
Dec 17 13:26:09 [435459] DB1 pacemakerd:     info: mcp_cpg_deliver:        Ignoring process list sent by peer for local node
Dec 17 13:26:09 [126688] DB1    pengine:     info: crm_log_init:   Changed active directory to /var/lib/pacemaker/cores
Dec 17 13:26:09 [126688] DB1    pengine:     info: qb_ipcs_us_publish:     server name: pengine
Dec 17 13:26:09 [126688] DB1    pengine:     info: main:   Starting pengine
Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from save_cib_contents
Dec 17 13:26:09 [435471] DB1        cib:     info: cib_process_ping:       Reporting our current digest to DB1: 9a96a41da603b14d0edab0b6bc962009 for 3.1248.21817663 (0x56045abbe830 0)
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_state_transition:    State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_recover:     Fast-tracking shutdown in response to errors
Dec 17 13:26:09 [435476] DB1       crmd:  warning: do_election_vote:       Not voting in election, we're in state S_RECOVERY
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_dc_release:  DC role released
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_te_control:  Transitioner is now inactive
Dec 17 13:26:09 [435476] DB1       crmd:    error: do_log: Input I_TERMINATE received in state S_RECOVERY from do_recover
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_state_transition:    State transition S_RECOVERY -> S_TERMINATE | input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover
Dec 17 13:26:09 [435476] DB1       crmd:     info: do_shutdown:    Disconnecting STONITH...
Dec 17 13:26:09 [435476] DB1       crmd:     info: tengine_stonith_connection_destroy:     Fencing daemon disconnected
Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 420 for AsapIP (AsapIP:420)
Dec 17 13:26:09 [435473] DB1       lrmd:     info: cancel_recurring_action:        Cancelling ocf operation AsapIP_monitor_10000
Dec 17 13:26:09 [435476] DB1       crmd:     info: stop_recurring_actions: Cancelling op 378 for Ping (Ping:378)
它说pengine是被9号信号杀死的。 经检查,在/var/lib/pacemaker/cores没有崩溃报告 它确实从这次事故中恢复了过来:

Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Verifying cluster type: 'corosync'
Dec 17 13:26:09 [126690] DB1       crmd:     info: get_cluster_type:       Assuming an active 'corosync' cluster
Dec 17 13:26:09 [126690] DB1       crmd:     info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
Dec 17 13:26:09 [126690] DB1       crmd:     info: do_cib_control: CIB connection established
之后,另一个节点APP1成为该节点的corosync日志中检查的DC。 它采取了以下行动:

Dec 17 13:26:12 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:0     ( Slave -> Master DB2 )
Dec 17 13:26:12 [379816] APP1    pengine:     info: LogActions: Leave   Postgresql9:1   (Stopped)
据我所知,新的DC正试图在DB2节点上推广PostgreSQL

但在完成该行动之前,它采取了以下行动:

Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Stop       Postgresql9:0     (              Master DB1 )   due to node availability
Dec 17 13:26:15 [379816] APP1    pengine:   notice: LogAction:   * Promote    Postgresql9:1     (     Slave -> Master DB2 )
它要求在DB1上停止postgres,在DB2上启动postgres

但在DB1的corosync日志中,当DC APP1请求时,它从未尝试停止postgresql。 在此之后,在每个LogAction日志中都会出现相同的内容

不幸的是,我没有/var/log/messages

我有两个问题: a) pengine被强行杀害的可能原因是什么(信号9)。 b) 即使它被杀死,起搏器能够恢复,为什么DC试图阻止DB1上的postgres, c) 即使在滚动操作停止DB1上的postgres之后,DB1也从未执行任何操作,可能的原因是什么

提前谢谢你的帮助