Linux 起搏器子过程终止
环境: 操作系统:redhat 7.5 起搏器1.1.19-8.el7_6.1.x86_64 corosync:2.4.3-2.el7_5.1.x86_64 8个节点设置,一个资源为PostgreSQL 10,一个备用节点。 主数据库节点:DB1 辅助数据库节点:DB2 问题: PostgreSQL突然停止。 在DB1的corosync日志中:Linux 起搏器子过程终止,linux,redhat,high-availability,pacemaker,Linux,Redhat,High Availability,Pacemaker,环境: 操作系统:redhat 7.5 起搏器1.1.19-8.el7_6.1.x86_64 corosync:2.4.3-2.el7_5.1.x86_64 8个节点设置,一个资源为PostgreSQL 10,一个备用节点。 主数据库节点:DB1 辅助数据库节点:DB2 问题: PostgreSQL突然停止。 在DB1的corosync日志中: Dec 17 13:26:08 [435476] DB1 crmd: error: crm_ipc_read: Connectio
Dec 17 13:26:08 [435476] DB1 crmd: error: crm_ipc_read: Connection to pengine failed
Dec 17 13:26:08 [435476] DB1 crmd: error: mainloop_gio_callback: Connection to pengine[0x55e53f095dc0] closed (I/O condition=25)
Dec 17 13:26:08 [435476] DB1 crmd: crit: pe_ipc_destroy: Connection to the Policy Engine failed | pid=-1 uuid=562cd7f3-9093-4626-adec-cd8b376e1b13
Dec 17 13:26:08 [435476] DB1 crmd: info: register_fsa_error_adv: Resetting the current action list
Dec 17 13:26:09 [435459] DB1 pacemakerd: warning: pcmk_child_exit: The pengine process (435475) terminated with signal 9 (core=0)
Dec 17 13:26:09 [435476] DB1 crmd: notice: save_cib_contents: Saved Cluster Information Base to /var/lib/pacemaker/pengine/pe-core-562cd7f3-9093-4626-adec-cd8b376e1b13.bz2 after Policy Engine crash
Dec 17 13:26:09 [435459] DB1 pacemakerd: notice: pcmk_process_exit: Respawning failed child process: pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd: info: start_child: Using uid=189 and group=189 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd: info: start_child: Forked child 126688 for process pengine
Dec 17 13:26:09 [435459] DB1 pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Dec 17 13:26:09 [435459] DB1 pacemakerd: info: mcp_cpg_deliver: Ignoring process list sent by peer for local node
Dec 17 13:26:09 [126688] DB1 pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
Dec 17 13:26:09 [126688] DB1 pengine: info: qb_ipcs_us_publish: server name: pengine
Dec 17 13:26:09 [126688] DB1 pengine: info: main: Starting pengine
Dec 17 13:26:09 [435476] DB1 crmd: error: do_log: Input I_ERROR received in state S_POLICY_ENGINE from save_cib_contents
Dec 17 13:26:09 [435471] DB1 cib: info: cib_process_ping: Reporting our current digest to DB1: 9a96a41da603b14d0edab0b6bc962009 for 3.1248.21817663 (0x56045abbe830 0)
Dec 17 13:26:09 [435476] DB1 crmd: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY | input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents
Dec 17 13:26:09 [435476] DB1 crmd: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 17 13:26:09 [435476] DB1 crmd: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Dec 17 13:26:09 [435476] DB1 crmd: info: do_dc_release: DC role released
Dec 17 13:26:09 [435476] DB1 crmd: info: do_te_control: Transitioner is now inactive
Dec 17 13:26:09 [435476] DB1 crmd: error: do_log: Input I_TERMINATE received in state S_RECOVERY from do_recover
Dec 17 13:26:09 [435476] DB1 crmd: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE | input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover
Dec 17 13:26:09 [435476] DB1 crmd: info: do_shutdown: Disconnecting STONITH...
Dec 17 13:26:09 [435476] DB1 crmd: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Dec 17 13:26:09 [435476] DB1 crmd: info: stop_recurring_actions: Cancelling op 420 for AsapIP (AsapIP:420)
Dec 17 13:26:09 [435473] DB1 lrmd: info: cancel_recurring_action: Cancelling ocf operation AsapIP_monitor_10000
Dec 17 13:26:09 [435476] DB1 crmd: info: stop_recurring_actions: Cancelling op 378 for Ping (Ping:378)
它说pengine是被9号信号杀死的。
经检查,在/var/lib/pacemaker/cores没有崩溃报告
它确实从这次事故中恢复了过来:
Dec 17 13:26:09 [126690] DB1 crmd: info: get_cluster_type: Verifying cluster type: 'corosync'
Dec 17 13:26:09 [126690] DB1 crmd: info: get_cluster_type: Assuming an active 'corosync' cluster
Dec 17 13:26:09 [126690] DB1 crmd: info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
Dec 17 13:26:09 [126690] DB1 crmd: info: do_cib_control: CIB connection established
之后,另一个节点APP1成为该节点的corosync日志中检查的DC。
它采取了以下行动:
Dec 17 13:26:12 [379816] APP1 pengine: notice: LogAction: * Promote Postgresql9:0 ( Slave -> Master DB2 )
Dec 17 13:26:12 [379816] APP1 pengine: info: LogActions: Leave Postgresql9:1 (Stopped)
据我所知,新的DC正试图在DB2节点上推广PostgreSQL
但在完成该行动之前,它采取了以下行动:
Dec 17 13:26:15 [379816] APP1 pengine: notice: LogAction: * Stop Postgresql9:0 ( Master DB1 ) due to node availability
Dec 17 13:26:15 [379816] APP1 pengine: notice: LogAction: * Promote Postgresql9:1 ( Slave -> Master DB2 )
它要求在DB1上停止postgres,在DB2上启动postgres
但在DB1的corosync日志中,当DC APP1请求时,它从未尝试停止postgresql。
在此之后,在每个LogAction日志中都会出现相同的内容
不幸的是,我没有/var/log/messages
我有两个问题:
a) pengine被强行杀害的可能原因是什么(信号9)。
b) 即使它被杀死,起搏器能够恢复,为什么DC试图阻止DB1上的postgres,
c) 即使在滚动操作停止DB1上的postgres之后,DB1也从未执行任何操作,可能的原因是什么
提前谢谢你的帮助