google pubsub消息处理速度减慢(python)

google pubsub消息处理速度减慢(python),python,google-kubernetes-engine,grpc,google-cloud-pubsub,concurrent.futures,Python,Google Kubernetes Engine,Grpc,Google Cloud Pubsub,Concurrent.futures,自从切换到最新的基于线程/回调的pythonlibs以来,我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生,我们想知道,在最近的库更改之后,是否有其他人遇到过类似的问题,或者是否知道我们可能错过的设置 从推送消息到3个工作人员(在python中)使用消息,我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒,在运行处理程序之前,我们还将代码更改为调用message.ack()。例如,self.sub\u client.subscribe(subscri

自从切换到最新的基于线程/回调的pythonlibs以来,我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生,我们想知道,在最近的库更改之后,是否有其他人遇到过类似的问题,或者是否知道我们可能错过的设置

从推送消息到3个工作人员(在python中)使用消息,我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒,在运行处理程序之前,我们还将代码更改为调用
message.ack()
。例如,
self.sub\u client.subscribe(subscription\u path,callback=self.message\u callback)
。这些消息不是重复的。当我们让他们排队时,我们以毫秒为单位记录时间,以了解他们排队的时间

for pod in worker-staging-deployment-1003989621-2mx0n worker-staging-deployment-1003989621-b6llt worker-staging-deployment-1003989621-lx4gq; do echo == $pod ==; kubectl logs $pod -c fra-worker | grep 'ACK start'; done
== fra-worker-staging-deployment-1003989621-2mx0n ==                                        
[2017-09-25 23:29:03,987] {pubsub.py:147} INFO - ACK start: 22 ms for 1506382143.88 (0.10699987411499023 secs)                                                                                                                                                                                                                                                                                                      
[2017-09-25 23:29:04,966] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.767 (0.19900012016296387 secs)
[2017-09-25 23:29:14,708] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.219 (10.488999843597412 secs)
[2017-09-25 23:29:17,706] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.229 (10.476999998092651 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.782 (32.984999895095825 secs)
[2017-09-25 23:30:00,649] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382146.257 (54.39199995994568 secs)
== fra-worker-staging-deployment-1003989621-b6llt ==
[2017-09-25 23:29:04,083] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.957 (0.12599992752075195 secs)
[2017-09-25 23:29:05,261] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.916 (0.3450000286102295 secs)
[2017-09-25 23:29:15,703] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.336 (11.367000102996826 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:25,630] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.812 (21.818000078201294 secs)
[2017-09-25 23:29:38,706] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.49 (34.21600008010864 secs)
[2017-09-25 23:30:01,752] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382146.696 (55.055999994277954 secs)                                                                                                                                                                                                                                                                                                       
== fra-worker-staging-deployment-1003989621-lx4gq ==                                                                                                                                                                                                                                                                                                                                                                
[2017-09-25 23:29:03,342] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382142.889 (0.4530000686645508 secs)                                                                                                                                               
[2017-09-25 23:29:04,955] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.907 (1.0469999313354492 secs)                                                                                                                                                                                                                   
[2017-09-25 23:29:14,704] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382143.888 (10.815999984741211 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:17,705] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.205 (10.5 secs)                                                                                                                                                                                                                                                                                                                     
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.197 (33.5699999332428 secs)                                            
[2017-09-25 23:29:59,733] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.269 (55.46399998664856 secs)                                                                                                                                                                                                                                                                                                        
[2017-09-25 23:31:18,870] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382146.924 (131.94599986076355 secs)                                                                                                                                                                                   
起初,从排队到阅读,消息似乎花费了很短的时间,但随后它们开始越来越晚,好像有10秒、32秒、55秒的延迟。(这些不是重复的,因此这不是由于ACK失败而导致的重试逻辑)

我们编写了一个小测试,它可以快速地对少量的发送者和消息进行测试,但是一旦我们将消息数量增加到1500个,发送者数量增加到3个,我们就会发现publish调用通常会返回一个带有异常结果的future(“某些消息未成功发布。结果显示大约每秒500条消息,但
publish()
调用的错误率>10%会引发此异常

Done in 2929 ms, 512.12 qps (154 10.3%)
Done in 2901 ms, 517.06 qps (165 11.0%)
Done in 2940 ms, 510.20 qps (217 14.5%)
虽然我们的发送者在3秒钟内完成(它们并行运行),但工作人员正在收到20秒前排队的消息

Got message {'tstamp': '1506557436.988', 'msg': 'msg#393@982'} 20.289 sec
下面是工作人员/侦听器:

import time

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer


TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def receive(message):
    decoded = b64json_decode(message.data)
    message.ack()
    took = time.time() - float(decoded.get('tstamp', 0))
    print(f'Got message {decoded} {took:0.3f} sec')


if __name__ == '__main__':
    client = pubsub_v1.SubscriberClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
    subs_path = client.subscription_path(NOTIFY_PROJECT, 'pubsub-worker')

    try:
        create_subscription(subs_path, topic_path)
    except Exception:
        pass
    print(f'Subscription: topic={topic_path} subscription={subs_path}')

    timer = Timer()
    client.subscribe(subs_path, callback=receive)
    time.sleep(120)
和发送者/发布者:

import os
import time
import concurrent.futures

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer

TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def publish(topic_path, message, client):
    tstamp = f'{time.time():0.3f}'
    data = {'tstamp': tstamp, 'msg': message}
    future = client.publish(topic_path, b64json_encode(data, raw=True))
    future.add_done_callback(lambda x: print(f'Publishing done callback: {data}'))
    return future


if __name__ == '__main__':
    client = pubsub_v1.PublisherClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)

    num = 1500
    pid = os.getpid()
    fs = []
    timer = Timer()
    for i in range(0, num):
        f = publish(topic_path, f'msg#{i}@{pid}', client)
        fs.append(f)
    print(f'Launched {len(fs)} futures in {timer.get_msecs()} ms')

    good = bad = 0
    for future in fs:
        try:
            data = future.result()
            # print(f'result: {data}')
            good += 1
        except Exception as exc:
            print(f'generated an exception: {exc} ({exc!r})')
            bad += 1
    took_ms = timer.get_msecs()
    pct = bad / num
    print(f'Done in {took_ms} ms, {num / took_ms * 1000:0.2f} qps ({bad} {pct:0.1%})')
下面是core.utils中的计时器类:

####################
# Time / Timing
####################


def utcnow():
    """Time now with tzinfo, mainly for mocking in unittests"""
    return arrow.utcnow()


def relative_time():
    """Relative time for finding timedeltas depening on your python version"""
    if sys.version_info[0] >= 3:
        return time.perf_counter()
    else:
        return time.time()


class Timer:
    def __init__(self):
        self.reset()

    def reset(self):
        self.start_time = relative_time()

    def get_msecs(self):
        return int((relative_time() - self.start_time) * 1000)

    def get_secs(self):
        return int((relative_time() - self.start_time))
此外,在我们的主代码中,我们偶尔会看到线程似乎无法恢复的IOErrors(除了超过了可以忽略的截止时间)。为了解决这个问题,我们包装了策略,让我们捕获一些异常并根据需要重新启动客户机(尽管我们不确定这是否正常工作)

以及我们的版本:

pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4

您使用的是哪个版本的
grpcio
?如果您能够使用
1.6.3
(当前最新版本),您的
|grep goog
将其从
pip冻结中剥离出来grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进,尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒,399.52 qps(69.9%)完成时间为2511毫秒,398.25 qps(119.9%)完成时间为3094毫秒,323.21 qps(123.12%)`作为一项测试,如果您有多个虚拟机并行使用订阅,会发生什么情况?多个虚拟机协同工作是否能够跟上订阅的进度?(我很好奇,问题是客户机不能足够快地使用管道上的数据,还是发布/订阅本身落后。猜测前者。)在目前的一些截止日期下,我们将尝试在周末完成它。如果你想让它旋转一下,代码就在那里。你使用的是什么版本的
grpcio
?你的
| grep goog
将它从你的
pip freeze
中剥离出来。如果你能够使用
1.6.3
(当前最新版本)grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进,尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒,399.52 qps(69.9%)完成时间为2511毫秒,398.25 qps(119.9%)完成时间为3094毫秒,323.21 qps(123.12%)`作为一项测试,如果您有多个虚拟机并行使用订阅,会发生什么情况?多个虚拟机协同工作是否能够跟上订阅的进度?(我很好奇,问题是客户机不能足够快地使用管道上的数据,还是发布/订阅本身落后。猜测前者。)在目前的一些截止日期下,我们将尝试在周末完成它。如果你想让它旋转,代码就在那里。
pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4