Google cloud platform Apache Beam:ReadFromText与ReadAllFromText
我正在运行一个ApacheBeam管道,从Google云存储读取文本文件,对这些文件执行一些解析,并将解析后的数据写入Bigquery 为了简短起见,忽略这里的解析和google_cloud_选项,我的代码如下:(apachebeam 2.5.0,使用GCP插件和Dataflow作为运行程序) 这运行良好,并成功地将相关数据附加到我的Bigquery表中,以获得少量输入文件。但是,当我将输入文件的数量增加到+-800k时,我得到一个错误: “BoundedSource.split()操作返回的BoundedSource对象的总大小大于允许的限制。” 我找到了建议使用ReadAllFromText代替ReadFromText的选项。Google cloud platform Apache Beam:ReadFromText与ReadAllFromText,google-cloud-platform,apache-beam,dataflow,Google Cloud Platform,Apache Beam,Dataflow,我正在运行一个ApacheBeam管道,从Google云存储读取文本文件,对这些文件执行一些解析,并将解析后的数据写入Bigquery 为了简短起见,忽略这里的解析和google_cloud_选项,我的代码如下:(apachebeam 2.5.0,使用GCP插件和Dataflow作为运行程序) 这运行良好,并成功地将相关数据附加到我的Bigquery表中,以获得少量输入文件。但是,当我将输入文件的数量增加到+-800k时,我得到一个错误: “BoundedSource.split()操作返回的B
但是,当我调出时,会出现以下错误:
回溯(最近一次呼叫最后一次):
文件“/Users/richardtbenade/Repos/de_020/main_isolated.py”,第240行,在
xmltobigquery.run_dataflow()
文件“/Users/richardtbenade/Repos/de_020/main_isolated.py”,第220行,运行数据流中
'parse xml to dict'>>beam.ParDo(XmlToDictFn(),job_spec=self.job_spec)|\
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/transforms/ptransform.py”,第831行,在__
返回self.transform.\uuuuror\uuuuuuu(pValue,self.label)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/transforms/ptransform.py”,第488行,在__
结果=p.apply(self、pvaluelish、label)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第464行,适用于
返回self.apply(transform,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第500行,适用于
pvalueish_result=self.runner.apply(转换,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第187行,在apply中
返回m(转换,输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第193行,在apply_PTransform中
返回transform.expand(输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/io/textio.py”,第470行,展开
返回pvalue |“ReadAllFiles”>>self.\u读取所有文件
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pvalue.py”,第109行,在__
返回self.pipeline.apply(pttransform,self)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第454行,在apply中
标签或变换。标签)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第464行,适用于
返回self.apply(transform,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第500行,适用于
pvalueish_result=self.runner.apply(转换,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第187行,在apply中
返回m(转换,输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第193行,在apply_PTransform中
返回transform.expand(输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/io/filebasedsource.py”,第416行,展开
|'ReadRange'>>ParDo(_ReadRange(self._source_from_file)))
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pvalue.py”,第109行,在__
返回self.pipeline.apply(pttransform,self)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第454行,在apply中
标签或变换。标签)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第464行,适用于
返回self.apply(transform,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第500行,适用于
pvalueish_result=self.runner.apply(转换,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第187行,在apply中
返回m(转换,输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第193行,在apply_PTransform中
返回transform.expand(输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/transforms/util.py”,第568行,展开
|'RemoveAndomkeys'>>映射(lambda t:t[1]))
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pvalue.py”,第109行,在__
返回self.pipeline.apply(pttransform,self)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pipeline.py”,第500行,适用于
pvalueish_result=self.runner.apply(转换,pvalueish)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第187行,在apply中
返回m(转换,输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/runners/runner.py”,第193行,在apply_PTransform中
返回transform.expand(输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/transforms/util.py”,第494行,展开
windowing\u saved=pcoll.windowing
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/pvalue.py”,第130行,在窗口中
自生成输入)
文件“/Users/richardtbenade/virtualenvs/de_020/lib/python2.7/site packages/apache_beam/transforms/ptransform.py”,第443行,在get_窗口中
返回输入[0]。正在打开窗口
文件
p = Pipeline(options=options)
lines = p | 'read from file' >>
beam.io.ReadFromText('some_gcs_bucket_path*') | \
'parse xml to dict' >> beam.ParDo(
beam.io.WriteToBigQuery(
'my_table',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
p.run()