Python 减少GroupByKey之后的PCollection_Python_Google Cloud Dataflow_Apache Beam

Python 减少GroupByKey之后的PCollection

python google-cloud-dataflow

Python 减少GroupByKey之后的PCollection,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我试图基于事务数据生成一个简单的客户摘要。例如，给定一个目标交易类型，发生了多少交易，总金额是多少原始输入的示例： custid desc amount 111 coffee 3.50 111 grocery 23.00 333 coffee 4.00 222 gas station 32.00 222 gas station 55.50 333 coffee 3.00 期望输出的示例： c

我试图基于事务数据生成一个简单的客户摘要。例如，给定一个目标交易类型，发生了多少交易，总金额是多少

原始输入的示例：

custid  desc        amount
111     coffee      3.50
111     grocery     23.00
333     coffee      4.00
222     gas station 32.00
222     gas station 55.50
333     coffee      3.00

期望输出的示例：

custid nbr_coffee amt_coffee nbr_gas_station amt_gas_station
111    1          3.50       0               0.00
222    0          0          2               87.50
333    2          7.00       0               0

我的目标运行程序将是Dataflow（但目前正在使用DirectRunner进行测试）

以下是我的代码片段：

def categorize_coffee(transaction):

    if transaction['trx_desc'] == 'coffee':
        transaction['coffee'] = True
    else:
        transaction['coffee'] = False

    return transaction

def categorize_gas_station(transaction):

    if transaction['trx_desc'] == 'gas station':
        transaction['gas_station'] = True
    else:
        transaction['gas_station'] = False

    return transaction

def summarize_coffee(grouping):

    key, values = grouping
    values = list(values)

    nbr = 0
    amt = 0

    for d in values:
        if d['coffee'] == True:
            nbr+=1
            amt+=d['amount']

    ret_val = {}
    ret_val['cust'] = d['cust']
    ret_val['nbr_coffee'] = nbr
    ret_val['amt_coffee'] = amt

    return ret_val

def summarize_gas_station(grouping):

    key, values = grouping
    values = list(values)

    nbr = 0
    amt = 0

    for d in values:
        if d['gas_station'] == True:
            nbr += 1
            amt += d['amount']

    ret_val = {}
    ret_val['cust'] = d['cust']
    ret_val['nbr_gas_station'] = nbr
    ret_val['amt_gas_station'] = amt

    return ret_val

def create_dict(row):

    vars = row.split(',')
    return {'cust': vars[0], 'trx_desc': str(vars[1]), 'amount': float(vars[2])}

with beam.Pipeline(options=pipeline_options) as p:

    categorized_trx = (
        p | 'get data' >> beam.io.ReadFromText('./test.csv')
        | beam.Map(create_dict)
        | beam.Map(categorize_coffee)
        | beam.Map(categorize_gas_station)
        | beam.Map(lambda trx: (trx['cust'], trx))
        | beam.GroupByKey()
    )

    coffee_trx = (categorized_trx | beam.Map(summarize_coffee))

    gas_station_trx = (categorized_trx | beam.Map(summarize_gas_station))

    result = (coffee_trx, gas_station_trx) | beam.Flatten()

目前的实际结果是：

{'amt_coffee': 7.0, 'cust': u'333', 'nbr_coffee': 2}
{'amt_coffee': 0, 'cust': u'222', 'nbr_coffee': 0}
{'amt_coffee': 3.5, 'cust': u'111', 'nbr_coffee': 1}
{'nbr_gas_station': 0, 'cust': u'333', 'amt_gas_station': 0}
{'nbr_gas_station': 2, 'cust': u'222', 'amt_gas_station': 87.5}
{'nbr_gas_station': 0, 'cust': u'111', 'amt_gas_station': 0}

并没有像我预期的那个样变平或连接。我是Beam新手-不确定我是否理解如何正确处理此问题，因此希望获得一些见解。

Beam提供的转换应允许您组合PCollection的元素。看起来对于您的用例，您可以使用它基于键组合PCollection的键控元素。作为合并功能，您可以提供一个功能或实现。

Beam提供的转换应允许您合并PCollection的元素。看起来对于您的用例，您可以使用它基于键组合PCollection的键控元素。作为联合收割机功能，您可以提供功能或实施。

这应该可以：

...

def summarize_coffee(grouping):

    ...

    return (d['cust'], ret_val)


def summarize_gas_station(grouping):

    ...

    return (d['cust'], ret_val)

...

def processJoin(row):
    (customer, data) = row
    coffee_trx=data['coffee_trx']
    gas_station_trx=data['gas_station_trx']
    return (customer, coffee_trx, gas_station_trx)

result = ({coffee_trx: coffee_trx, gas_station_trx: gas_station_trx}
         | 'Group' >> beam.CoGroupByKey()    
         | 'Reshape' >> beam.Map(processJoin)
         | 'Unwind' >> beam.FlatMap(lambda x: x)
         )

这应该起作用：

...

def summarize_coffee(grouping):

    ...

    return (d['cust'], ret_val)


def summarize_gas_station(grouping):

    ...

    return (d['cust'], ret_val)

...

def processJoin(row):
    (customer, data) = row
    coffee_trx=data['coffee_trx']
    gas_station_trx=data['gas_station_trx']
    return (customer, coffee_trx, gas_station_trx)

result = ({coffee_trx: coffee_trx, gas_station_trx: gas_station_trx}
         | 'Group' >> beam.CoGroupByKey()    
         | 'Reshape' >> beam.Map(processJoin)
         | 'Unwind' >> beam.FlatMap(lambda x: x)
         )

尽管有不同的问题，但这里关于组合器的答案应该会有所帮助：@RezaRokni这个问题让我看到，如果将值键入（'nbr_coffee'，1），如何组合它们。但就我而言，我有（'333'，（'nbr_coffee'，1））。如何根据每个客户的每个密钥进行组合？此外，我将有多个合并器，因为有时我需要计数，有时我需要数量-这是否会让我回到合并两个集合的问题？展平将所有结果合并到一个PCollection中。要加入它们，您可以使用CoGroupByKey和

cust

作为键，尽管不同的问题组合器的答案应该有所帮助：@RezaRokni这个问题让我看到，如果值被设置为（'nbr_coffee'，1），我如何组合它们。但就我而言，我有（'333'，（'nbr_coffee'，1））。如何根据每个客户的每个密钥进行组合？此外，我将有多个合并器，因为有时我需要计数，有时我需要数量-这是否会让我回到合并两个集合的问题？展平将所有结果合并到一个PCollection中。要加入他们，您可以使用CoGroupByKey和

cust

作为键。您是否能够使用combine提供一个具体的示例，以帮助我？组合器是在上面的评论中推荐的，我不知道如何使用它来实现我想要的输出。你能提供一个使用组合器的具体例子来帮助我吗？组合器是在上面的评论中推荐的，我不知道如何使用它来实现我想要的输出。