google pubsub消息处理速度减慢（python）_Python_Google Kubernetes Engine_Grpc_Google Cloud Pubsub_Concurrent.futures

google pubsub消息处理速度减慢（python）

python

google pubsub消息处理速度减慢（python）,python,google-kubernetes-engine,grpc,google-cloud-pubsub,concurrent.futures,Python,Google Kubernetes Engine,Grpc,Google Cloud Pubsub,Concurrent.futures,自从切换到最新的基于线程/回调的pythonlibs以来，我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生，我们想知道，在最近的库更改之后，是否有其他人遇到过类似的问题，或者是否知道我们可能错过的设置从推送消息到3个工作人员（在python中）使用消息，我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒，在运行处理程序之前，我们还将代码更改为调用message.ack（）。例如，self.sub\u client.subscribe（subscri

自从切换到最新的基于线程/回调的pythonlibs以来，我们的子生产者和消费者之间的速度一直很慢。我们对谷歌的pubsub还比较陌生，我们想知道，在最近的库更改之后，是否有其他人遇到过类似的问题，或者是否知道我们可能错过的设置

从推送消息到3个工作人员（在python中）使用消息，我们看到了意想不到的速度减慢。我们的处理程序处理每个请求需要少量毫秒，在运行处理程序之前，我们还将代码更改为调用

message.ack（）

。例如，

self.sub\u client.subscribe（subscription\u path，callback=self.message\u callback）

。这些消息不是重复的。当我们让他们排队时，我们以毫秒为单位记录时间，以了解他们排队的时间

for pod in worker-staging-deployment-1003989621-2mx0n worker-staging-deployment-1003989621-b6llt worker-staging-deployment-1003989621-lx4gq; do echo == $pod ==; kubectl logs $pod -c fra-worker | grep 'ACK start'; done
== fra-worker-staging-deployment-1003989621-2mx0n ==                                        
[2017-09-25 23:29:03,987] {pubsub.py:147} INFO - ACK start: 22 ms for 1506382143.88 (0.10699987411499023 secs)                                                                                                                                                                                                                                                                                                      
[2017-09-25 23:29:04,966] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.767 (0.19900012016296387 secs)
[2017-09-25 23:29:14,708] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.219 (10.488999843597412 secs)
[2017-09-25 23:29:17,706] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.229 (10.476999998092651 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.782 (32.984999895095825 secs)
[2017-09-25 23:30:00,649] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382146.257 (54.39199995994568 secs)
== fra-worker-staging-deployment-1003989621-b6llt ==
[2017-09-25 23:29:04,083] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.957 (0.12599992752075195 secs)
[2017-09-25 23:29:05,261] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.916 (0.3450000286102295 secs)
[2017-09-25 23:29:15,703] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.336 (11.367000102996826 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:25,630] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.812 (21.818000078201294 secs)
[2017-09-25 23:29:38,706] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.49 (34.21600008010864 secs)
[2017-09-25 23:30:01,752] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382146.696 (55.055999994277954 secs)                                                                                                                                                                                                                                                                                                       
== fra-worker-staging-deployment-1003989621-lx4gq ==                                                                                                                                                                                                                                                                                                                                                                
[2017-09-25 23:29:03,342] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382142.889 (0.4530000686645508 secs)                                                                                                                                               
[2017-09-25 23:29:04,955] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.907 (1.0469999313354492 secs)                                                                                                                                                                                                                   
[2017-09-25 23:29:14,704] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382143.888 (10.815999984741211 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:17,705] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.205 (10.5 secs)                                                                                                                                                                                                                                                                                                                     
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.197 (33.5699999332428 secs)                                            
[2017-09-25 23:29:59,733] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.269 (55.46399998664856 secs)                                                                                                                                                                                                                                                                                                        
[2017-09-25 23:31:18,870] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382146.924 (131.94599986076355 secs)

起初，从排队到阅读，消息似乎花费了很短的时间，但随后它们开始越来越晚，好像有10秒、32秒、55秒的延迟。（这些不是重复的，因此这不是由于ACK失败而导致的重试逻辑）

我们编写了一个小测试，它可以快速地对少量的发送者和消息进行测试，但是一旦我们将消息数量增加到1500个，发送者数量增加到3个，我们就会发现publish调用通常会返回一个带有异常结果的future（“某些消息未成功发布。结果显示大约每秒500条消息，但

publish（）

调用的错误率>10%会引发此异常

Done in 2929 ms, 512.12 qps (154 10.3%)
Done in 2901 ms, 517.06 qps (165 11.0%)
Done in 2940 ms, 510.20 qps (217 14.5%)

虽然我们的发送者在3秒钟内完成（它们并行运行），但工作人员正在收到20秒前排队的消息

Got message {'tstamp': '1506557436.988', 'msg': 'msg#393@982'} 20.289 sec

下面是工作人员/侦听器：

import time

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer


TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def receive(message):
    decoded = b64json_decode(message.data)
    message.ack()
    took = time.time() - float(decoded.get('tstamp', 0))
    print(f'Got message {decoded} {took:0.3f} sec')


if __name__ == '__main__':
    client = pubsub_v1.SubscriberClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
    subs_path = client.subscription_path(NOTIFY_PROJECT, 'pubsub-worker')

    try:
        create_subscription(subs_path, topic_path)
    except Exception:
        pass
    print(f'Subscription: topic={topic_path} subscription={subs_path}')

    timer = Timer()
    client.subscribe(subs_path, callback=receive)
    time.sleep(120)

和发送者/发布者：

import os
import time
import concurrent.futures

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer

TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def publish(topic_path, message, client):
    tstamp = f'{time.time():0.3f}'
    data = {'tstamp': tstamp, 'msg': message}
    future = client.publish(topic_path, b64json_encode(data, raw=True))
    future.add_done_callback(lambda x: print(f'Publishing done callback: {data}'))
    return future


if __name__ == '__main__':
    client = pubsub_v1.PublisherClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)

    num = 1500
    pid = os.getpid()
    fs = []
    timer = Timer()
    for i in range(0, num):
        f = publish(topic_path, f'msg#{i}@{pid}', client)
        fs.append(f)
    print(f'Launched {len(fs)} futures in {timer.get_msecs()} ms')

    good = bad = 0
    for future in fs:
        try:
            data = future.result()
            # print(f'result: {data}')
            good += 1
        except Exception as exc:
            print(f'generated an exception: {exc} ({exc!r})')
            bad += 1
    took_ms = timer.get_msecs()
    pct = bad / num
    print(f'Done in {took_ms} ms, {num / took_ms * 1000:0.2f} qps ({bad} {pct:0.1%})')

下面是core.utils中的计时器类：

####################
# Time / Timing
####################


def utcnow():
    """Time now with tzinfo, mainly for mocking in unittests"""
    return arrow.utcnow()


def relative_time():
    """Relative time for finding timedeltas depening on your python version"""
    if sys.version_info[0] >= 3:
        return time.perf_counter()
    else:
        return time.time()


class Timer:
    def __init__(self):
        self.reset()

    def reset(self):
        self.start_time = relative_time()

    def get_msecs(self):
        return int((relative_time() - self.start_time) * 1000)

    def get_secs(self):
        return int((relative_time() - self.start_time))

此外，在我们的主代码中，我们偶尔会看到线程似乎无法恢复的IOErrors（除了超过了可以忽略的截止时间）。为了解决这个问题，我们包装了策略，让我们捕获一些异常并根据需要重新启动客户机（尽管我们不确定这是否正常工作）

以及我们的版本：

pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4

您使用的是哪个版本的

grpcio

？如果您能够使用

1.6.3

（当前最新版本），您的

|grep goog

将其从

pip冻结中剥离出来grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进，尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒，399.52 qps（69.9%）完成时间为2511毫秒，398.25 qps（119.9%）完成时间为3094毫秒，323.21 qps（123.12%）`作为一项测试，如果您有多个虚拟机并行使用订阅，会发生什么情况？多个虚拟机协同工作是否能够跟上订阅的进度？（我很好奇，问题是客户机不能足够快地使用管道上的数据，还是发布/订阅本身落后。猜测前者。）在目前的一些截止日期下，我们将尝试在周末完成它。如果你想让它旋转一下，代码就在那里。你使用的是什么版本的grpcio
？你的| grep goog
将它从你的pip freeze
中剥离出来。如果你能够使用1.6.3（当前最新版本）grpc-google-iam-v1==0.11.1 grpcio==1.4.0没有太大的改进，尽管发布到pubsub比我目前在制作中看到的更快、更一致。`完成时间为2503毫秒，399.52 qps（69.9%）完成时间为2511毫秒，398.25 qps（119.9%）完成时间为3094毫秒，323.21 qps（123.12%）`作为一项测试，如果您有多个虚拟机并行使用订阅，会发生什么情况？多个虚拟机协同工作是否能够跟上订阅的进度？（我很好奇，问题是客户机不能足够快地使用管道上的数据，还是发布/订阅本身落后。猜测前者。）在目前的一些截止日期下，我们将尝试在周末完成它。如果你想让它旋转，代码就在那里。
pip freeze | grep goog
gapic-google-cloud-pubsub-v1==0.15.4
google-auth==1.0.2
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.2
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4