Google cloud platform 如何在前两个字符上筛选pCollection（Python）_Google Cloud Platform_Google Cloud Dataflow_Apache Beam

Google cloud platform 如何在前两个字符上筛选pCollection（Python）

google-cloud-platform google-cloud-dataflow

Google cloud platform 如何在前两个字符上筛选pCollection（Python）,google-cloud-platform,google-cloud-dataflow,apache-beam,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,我希望有人能帮忙我对ApacheBeam、云数据流相当陌生，我想做的是读取GZIP文件内容文件是固定宽度的根据前两个字符将所述内容过滤到另一个pCollection 到目前为止，我所拥有的： --PAR DO函数 class FilterHeader(beam.DoFn): def process(self, element): if element[:2] == '01': yield element else: return 'Head

我希望有人能帮忙

我对ApacheBeam、云数据流相当陌生，我想做的是

读取GZIP文件内容

文件是固定宽度的

根据前两个字符将所述内容过滤到另一个pCollection

到目前为止，我所拥有的：

--PAR DO函数

class FilterHeader(beam.DoFn):
def process(self, element):
    if element[:2] == '01':
        yield element
    else:
        return 'Header Not found' # Return nothing - this was blank, as I was just trying to view in the return came back with anything if not the row

我的管道如下

with beam.Pipeline(options=PipelineOptions(pipeline_args)) as p:
# Initial PCollection Full File
rows = (
    p | 'Read daily Spot File' >> beam.io.ReadFromText(
                file_pattern='gs://<my bucket>/filename.gz', 
                compression_type='gzip',
                coder=coders.BytesCoder(),
                skip_header_lines=0))
# Header Collection - filtered on first two characters = 01
header_collection = (
    rows    | 'Filter Record Type 01 to our HEADER COLLECTION' >> beam.ParDo(FilterHeader())
            | 'Output Header Rows' >> beam.io.WriteToText('gs://<destination bucket>/new_fileName.txt'))

将beam.Pipeline（options=PipelineOptions（Pipeline_args））作为p:
#初始PCollection完整文件
行=(
p |“读取每日Spot文件”>>beam.io.ReadFromText(
file_pattern='gs:///filename.gz'，
压缩\u type='gzip'，
coder=coders.BytesCoder（），
跳过（页眉（行=0））
#标题集合-根据前两个字符过滤=01
标题\u集合=(
行|“将记录类型01筛选到我们的标题集合”>>beam.ParDo（FilterHeader（））
|“输出标题行”>>beam.io.WriteToText（'gs:///new_fileName.txt'））

当我移除过滤器时，我可以输出所有行，这样文件或初始pCollection就不会有任何问题。一旦我添加了过滤器，我要查找的行就不会显示出来。是的，数据存在于文件中，即有一行以01开头，作为第一个字符

有什么简单的东西我遗漏了吗

非常感谢任何方向。

Hm这应该行得通。。只是在我这边复制了它。也许你的文件01前面有个空格？如果元素中的'01'，您可以尝试使用

吗？是的，我怀疑这是某种字符串不匹配。我建议通过打印日志来确认是否出现这种情况。好的-谢谢你的反馈。非常感谢。我将尝试打印前两个字符以查看日志中显示的内容。我确实想知道传递的元素本身是否是一个列表，因此我稍微更改了if语句，使其读取str（元素[0]）[：2]-我确实将其更改为str（元素[：2]）因为它记录了一个类型错误，即它认为元素是int类型。可能是因为文件是以字节为单位的？这就解决了问题…row=str（element.decode（'utf-8'，'ignore'））如果row[：2]='01'：yield element