Python Google云数据流如何减少http_请求的大小_Python_Google Cloud Dataflow_Apache Beam

Python Google云数据流如何减少http_请求的大小

python google-cloud-dataflow

Python Google云数据流如何减少http_请求的大小,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我有一个数据流作业似乎失败了，因为在尝试创建作业时发送了一个过大的http_请求体。这是请求标头： {'content-length': '107245818', 'content-type': 'application/json', 'accept-encoding': 'gzip, deflate', 'accept': 'application/json', 'user-agent': 'x_xxxxxxxx'} 发送请求给我： 413. That's an error. Your c

我有一个数据流作业似乎失败了，因为在尝试创建作业时发送了一个过大的http_请求体。这是请求标头：

{'content-length': '107245818', 'content-type': 'application/json', 'accept-encoding': 'gzip, deflate', 'accept': 'application/json', 'user-agent': 'x_xxxxxxxx'}

发送请求给我：

413. That's an error.

Your client issued a request that was too large.

请求主体中的哪些内容使其如此庞大？我能做些什么来缩小它的规模或者让谷歌的服务器接受这个请求

我使用的是ApacheBeam Python SDK版本2.4.0

我运行了序列化函数定义，以查看是什么占用了这么多空间。这似乎完全是由于

CombineFn

的

extract\u输出定义造成的
从：
def extract_output(self, accumulator):
    output = zip(self.id_list, accumulator[1], *accumulator[0])
    return output

到
将内容长度
从'74538844'
减少到'858884'
大约100倍
累加器
容纳两个尺寸为len（id\u列表）x len（id\u列表）
和len（id\u列表）
的numpy arrayid_list
为每行数组提供整数标签（在本例中大约为3000），并且在构建管道时已知
我不知道为什么会发生这种情况，但在CombineFn生成大小合理的请求并产生大致相同的输出后，在DoFn中键入ID
 这意味着作业的JSON表示太大（）需要优化（没有代码很难知道）。最大容量为10MB，您似乎有点超过了它。使用--dataflow\u job\u file=检查它。
def extract_output(self, accumulator):
    return accumulator