google pubsub消息处理速度减慢(python)
自从切换到最新的基于线程/回调的pythonlibs以来,我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生,我们想知道,在最近的库更改之后,是否有其他人遇到过类似的问题,或者是否知道我们可能错过的设置 从推送消息到3个工作人员(在python中)使用消息,我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒,在运行处理程序之前,我们还将代码更改为调用google pubsub消息处理速度减慢(python),python,google-kubernetes-engine,grpc,google-cloud-pubsub,concurrent.futures,Python,Google Kubernetes Engine,Grpc,Google Cloud Pubsub,Concurrent.futures,自从切换到最新的基于线程/回调的pythonlibs以来,我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生,我们想知道,在最近的库更改之后,是否有其他人遇到过类似的问题,或者是否知道我们可能错过的设置 从推送消息到3个工作人员(在python中)使用消息,我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒,在运行处理程序之前,我们还将代码更改为调用message.ack()。例如,self.sub\u client.subscribe(subscri
message.ack()
。例如,self.sub\u client.subscribe(subscription\u path,callback=self.message\u callback)
。这些消息不是重复的。当我们让他们排队时,我们以毫秒为单位记录时间,以了解他们排队的时间
for pod in worker-staging-deployment-1003989621-2mx0n worker-staging-deployment-1003989621-b6llt worker-staging-deployment-1003989621-lx4gq; do echo == $pod ==; kubectl logs $pod -c fra-worker | grep 'ACK start'; done
== fra-worker-staging-deployment-1003989621-2mx0n ==
[2017-09-25 23:29:03,987] {pubsub.py:147} INFO - ACK start: 22 ms for 1506382143.88 (0.10699987411499023 secs)
[2017-09-25 23:29:04,966] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.767 (0.19900012016296387 secs)
[2017-09-25 23:29:14,708] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.219 (10.488999843597412 secs)
[2017-09-25 23:29:17,706] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.229 (10.476999998092651 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.782 (32.984999895095825 secs)
[2017-09-25 23:30:00,649] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382146.257 (54.39199995994568 secs)
== fra-worker-staging-deployment-1003989621-b6llt ==
[2017-09-25 23:29:04,083] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.957 (0.12599992752075195 secs)
[2017-09-25 23:29:05,261] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.916 (0.3450000286102295 secs)
[2017-09-25 23:29:15,703] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.336 (11.367000102996826 secs)
[2017-09-25 23:29:25,630] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.812 (21.818000078201294 secs)
[2017-09-25 23:29:38,706] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.49 (34.21600008010864 secs)
[2017-09-25 23:30:01,752] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382146.696 (55.055999994277954 secs)
== fra-worker-staging-deployment-1003989621-lx4gq ==
[2017-09-25 23:29:03,342] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382142.889 (0.4530000686645508 secs)
[2017-09-25 23:29:04,955] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.907 (1.0469999313354492 secs)
[2017-09-25 23:29:14,704] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382143.888 (10.815999984741211 secs)
[2017-09-25 23:29:17,705] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.205 (10.5 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.197 (33.5699999332428 secs)
[2017-09-25 23:29:59,733] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.269 (55.46399998664856 secs)
[2017-09-25 23:31:18,870] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382146.924 (131.94599986076355 secs)
起初,从排队到阅读,消息似乎花费了很短的时间,但随后它们开始越来越晚,好像有10秒、32秒、55秒的延迟。(这些不是重复的,因此这不是由于ACK失败而导致的重试逻辑)
我们编写了一个小测试,它可以快速地对少量的发送者和消息进行测试,但是一旦我们将消息数量增加到1500个,发送者数量增加到3个,我们就会发现publish调用通常会返回一个带有异常结果的future(“某些消息未成功发布。结果显示大约每秒500条消息,但publish()
调用的错误率>10%会引发此异常
Done in 2929 ms, 512.12 qps (154 10.3%)
Done in 2901 ms, 517.06 qps (165 11.0%)
Done in 2940 ms, 510.20 qps (217 14.5%)
虽然我们的发送者在3秒钟内完成(它们并行运行),但工作人员正在收到20秒前排队的消息
Got message {'tstamp': '1506557436.988', 'msg': 'msg#393@982'} 20.289 sec
下面是工作人员/侦听器:
import time
from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc
from core.utils import b64json_decode, b64json_encode, Timer
TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='
def receive(message):
decoded = b64json_decode(message.data)
message.ack()
took = time.time() - float(decoded.get('tstamp', 0))
print(f'Got message {decoded} {took:0.3f} sec')
if __name__ == '__main__':
client = pubsub_v1.SubscriberClient()
topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
subs_path = client.subscription_path(NOTIFY_PROJECT, 'pubsub-worker')
try:
create_subscription(subs_path, topic_path)
except Exception:
pass
print(f'Subscription: topic={topic_path} subscription={subs_path}')
timer = Timer()
client.subscribe(subs_path, callback=receive)
time.sleep(120)
和发送者/发布者:
import os
import time
import concurrent.futures
from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc
from core.utils import b64json_decode, b64json_encode, Timer
TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='
def publish(topic_path, message, client):
tstamp = f'{time.time():0.3f}'
data = {'tstamp': tstamp, 'msg': message}
future = client.publish(topic_path, b64json_encode(data, raw=True))
future.add_done_callback(lambda x: print(f'Publishing done callback: {data}'))
return future
if __name__ == '__main__':
client = pubsub_v1.PublisherClient()
topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
num = 1500
pid = os.getpid()
fs = []
timer = Timer()
for i in range(0, num):
f = publish(topic_path, f'msg#{i}@{pid}', client)
fs.append(f)
print(f'Launched {len(fs)} futures in {timer.get_msecs()} ms')
good = bad = 0
for future in fs:
try:
data = future.result()
# print(f'result: {data}')
good += 1
except Exception as exc:
print(f'generated an exception: {exc} ({exc!r})')
bad += 1
took_ms = timer.get_msecs()
pct = bad / num
print(f'Done in {took_ms} ms, {num / took_ms * 1000:0.2f} qps ({bad} {pct:0.1%})')
下面是core.utils中的计时器类:
####################
# Time / Timing
####################
def utcnow():
"""Time now with tzinfo, mainly for mocking in unittests"""
return arrow.utcnow()
def relative_time():
"""Relative time for finding timedeltas depening on your python version"""
if sys.version_info[0] >= 3:
return time.perf_counter()
else:
return time.time()
class Timer:
def __init__(self):
self.reset()
def reset(self):
self.start_time = relative_time()
def get_msecs(self):
return int((relative_time() - self.start_time) * 1000)
def get_secs(self):
return int((relative_time() - self.start_time))
此外,在我们的主代码中,我们偶尔会看到线程似乎无法恢复的IOErrors(除了超过了可以忽略的截止时间)。为了解决这个问题,我们包装了策略,让我们捕获一些异常并根据需要重新启动客户机(尽管我们不确定这是否正常工作)
以及我们的版本:
pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4
您使用的是哪个版本的
grpcio
?如果您能够使用1.6.3
(当前最新版本),您的|grep goog
将其从pip冻结中剥离出来grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进,尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒,399.52 qps(69.9%)完成时间为2511毫秒,398.25 qps(119.9%)完成时间为3094毫秒,323.21 qps(123.12%)`作为一项测试,如果您有多个虚拟机并行使用订阅,会发生什么情况?多个虚拟机协同工作是否能够跟上订阅的进度?(我很好奇,问题是客户机不能足够快地使用管道上的数据,还是发布/订阅本身落后。猜测前者。)在目前的一些截止日期下,我们将尝试在周末完成它。如果你想让它旋转一下,代码就在那里。你使用的是什么版本的grpcio
?你的| grep goog
将它从你的pip freeze
中剥离出来。如果你能够使用1.6.3
(当前最新版本)grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进,尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒,399.52 qps(69.9%)完成时间为2511毫秒,398.25 qps(119.9%)完成时间为3094毫秒,323.21 qps(123.12%)`作为一项测试,如果您有多个虚拟机并行使用订阅,会发生什么情况?多个虚拟机协同工作是否能够跟上订阅的进度?(我很好奇,问题是客户机不能足够快地使用管道上的数据,还是发布/订阅本身落后。猜测前者。)在目前的一些截止日期下,我们将尝试在周末完成它。如果你想让它旋转,代码就在那里。
pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4