Ubuntu Apache-随机中断

Ubuntu Apache-随机中断,ubuntu,flask,ubuntu-16.04,mod-wsgi,devops,Ubuntu,Flask,Ubuntu 16.04,Mod Wsgi,Devops,最近我遇到了Apache的新问题。我们有一个在Flask(1.0.2)中运行的Python(3.5)应用程序 $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenial 我们在ELB(AWS Elasict Loadbalancer)后面有两台服务器,它们突然(在过去3个月

最近我遇到了Apache的新问题。我们有一个在Flask(1.0.2)中运行的Python(3.5)应用程序

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.2 LTS
Release:    16.04
Codename:   xenial
我们在ELB(AWS Elasict Loadbalancer)后面有两台服务器,它们突然(在过去3个月的配置上运行)开始出现故障。我发现通过ELB的警报和外部监控工具。我们突然开始得到E408(超时)和E503(服务不可用)

我开始挖掘,看看是什么原因, 在apache日志中,我发现许多消息(似乎就在失败之前)如下:

[Mon Jun 25 22:27:04.613967 2018] [wsgi:error] [pid 1275:tid 139684390848256] (70008)Partial results are valid but processing is incomplete: [client 1.2.3.4:2819] mod_wsgi (pid=1275): Unable to get bucket brigade for request., referer: https://xx.xx.xx/
Jun 25 22:13:25 my_hostname systemd[1]: Created slice User Slice of ubuntu.
Jun 25 22:13:25 my_hostname systemd[1]: Starting User Manager for UID 1000...
Jun 25 22:13:25 my_hostname systemd[1]: Started Session 1424 of user ubuntu.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Sockets.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Timers.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Paths.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Basic System.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Default.
Jun 25 22:13:25 my_hostname systemd[6239]: Startup finished in 8ms.
Jun 25 22:13:25 my_hostname systemd[1]: Started User Manager for UID 1000.
Jun 25 22:14:47 my_hostname systemd[1]: Stopping LSB: Apache2 web server...
Jun 25 22:14:47 my_hostname apache2[6624]:  * Stopping Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6624]:  *
Jun 25 22:14:59 my_hostname systemd[1]: Stopped LSB: Apache2 web server.
Jun 25 22:14:59 my_hostname systemd[1]: Starting LSB: Apache2 web server...
Jun 25 22:14:59 my_hostname apache2[6660]:  * Starting Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6660]: AH00557: apache2: apr_sockaddr_info_get() failed for my_hostname
Jun 25 22:14:59 my_hostname apache2[6660]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1. Set the 'ServerName' directive globally to suppress this message
Jun 25 22:15:00 my_hostname apache2[6660]:  *
Jun 25 22:15:00 my_hostname systemd[1]: Started LSB: Apache2 web server.
我还查询了syslog,看到了这个:

[Mon Jun 25 22:27:04.613967 2018] [wsgi:error] [pid 1275:tid 139684390848256] (70008)Partial results are valid but processing is incomplete: [client 1.2.3.4:2819] mod_wsgi (pid=1275): Unable to get bucket brigade for request., referer: https://xx.xx.xx/
Jun 25 22:13:25 my_hostname systemd[1]: Created slice User Slice of ubuntu.
Jun 25 22:13:25 my_hostname systemd[1]: Starting User Manager for UID 1000...
Jun 25 22:13:25 my_hostname systemd[1]: Started Session 1424 of user ubuntu.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Sockets.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Timers.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Paths.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Basic System.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Default.
Jun 25 22:13:25 my_hostname systemd[6239]: Startup finished in 8ms.
Jun 25 22:13:25 my_hostname systemd[1]: Started User Manager for UID 1000.
Jun 25 22:14:47 my_hostname systemd[1]: Stopping LSB: Apache2 web server...
Jun 25 22:14:47 my_hostname apache2[6624]:  * Stopping Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6624]:  *
Jun 25 22:14:59 my_hostname systemd[1]: Stopped LSB: Apache2 web server.
Jun 25 22:14:59 my_hostname systemd[1]: Starting LSB: Apache2 web server...
Jun 25 22:14:59 my_hostname apache2[6660]:  * Starting Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6660]: AH00557: apache2: apr_sockaddr_info_get() failed for my_hostname
Jun 25 22:14:59 my_hostname apache2[6660]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1. Set the 'ServerName' directive globally to suppress this message
Jun 25 22:15:00 my_hostname apache2[6660]:  *
Jun 25 22:15:00 my_hostname systemd[1]: Started LSB: Apache2 web server.
有趣的是,两台服务器(几乎完全相同)同时出现故障(由于部署了新版本,它们几乎在同一时间重新启动,而且两台服务器的通信量可能大致相同,因为它们位于一个负载平衡器后面)

已经试图找到类似的问题,但迄今为止运气不佳

还有一件有趣的事,我在日志中发现了一些类似的消息:

[Mon Jun 25 22:27:04.657763 2018] [wsgi:error] [pid 1274:tid 139684507617024] [remote 172.31.12.149:720] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658503 2018] [wsgi:error] [pid 1274:tid 139684482414336] [remote 172.31.12.149:62417] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658528 2018] [wsgi:error] [pid 1274:tid 139684532819712] [remote 172.31.12.149:52139] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658584 2018] [wsgi:error] [pid 1274:tid 139684482414336] [remote 172.31.12.149:62417] OSError: failed to write data
[Mon Jun 25 22:27:04.658818 2018] [wsgi:error] [pid 1274:tid 139684516017920] [remote 172.31.12.149:208] OSError: failed to write data
[Mon Jun 25 22:27:04.659999 2018] [wsgi:error] [pid 1274:tid 139684532819712] [remote 172.31.12.149:52139] OSError: failed to write data
[Mon Jun 25 22:27:04.660411 2018] [wsgi:error] [pid 1274:tid 139684507617024] [remote 172.31.12.149:720] OSError: failed to write data
不确定是否相关,但我确定我们在完成很多请求之前取消了它们(出于某种原因)

此外,我们在Ubuntu+Flask上运行了多年(很可能是相同的设置),我们从未遇到过这样的问题


非常感谢您的任何建议,谢谢

您使用的是什么mod_wsgi配置?您是否使用mod_wsgi的守护程序模式并设置所需的超时,以便在发生锁定时自动恢复?您是否通过将
WSGIApplicationGroup
设置为
%{GLOBAL}
来强制使用主Python解释器,以避免因Python的第三方C扩展而导致无法在子解释器中工作的锁定。确实需要查看您的配置。@GrahamDumpleton我们正面临同样的问题,通过将
WSGIApplicationGroup
设置为
%{GLOBAL}