数据流的python列表到字典_Python_Json_Dictionary_Google Cloud Platform_Dataflow

数据流的python列表到字典

python json dictionary google-cloud-platform

数据流的python列表到字典,python,json,dictionary,google-cloud-platform,dataflow,Python,Json,Dictionary,Google Cloud Platform,Dataflow,我正在尝试将JSON文件转换为字典并应用键/值对，这样我就可以使用groupbykey（）基本上消除键/值对的重复数据这是文件的原始内容： {“tax\u pd”：“200003”，“ein”：“720378282”} {“tax_pd”：“200012”，“ein”：“274027765”} {“tax_pd”：“200012”，“ein”：“042746989”} {“税务局”：“200012”，“ein”：“205993971”} 我将其格式化如下：（u'201208'，u'01062

我正在尝试将JSON文件转换为字典并应用键/值对，这样我就可以使用groupbykey（）基本上消除键/值对的重复数据

这是文件的原始内容：

{“tax\u pd”：“200003”，“ein”：“720378282”}
{“tax_pd”：“200012”，“ein”：“274027765”}
{“tax_pd”：“200012”，“ein”：“042746989”}
{“税务局”：“200012”，“ein”：“205993971”}

我将其格式化如下：

（u'201208'，u'010620100'）
（u'201208'，u'860785769'）
（u'201208'，u'371650138'）
（u'201208'，u'237253410'）

我想将它们转换为键/值对，以便在数据流管道中应用GroupByKey。我想我需要先把它变成字典

我是python和google云应用程序的新手，一些帮助会很棒

编辑：代码片段

梁管道（选项=管道选项）为p:
（p
|'ReadInputText'>>beam.io.ReadFromText（已知参数输入）
|'YieldWords'>>beam.ParDo（ExtractWordsFn（））
#|'GroupByKey'>>beam.GroupByKey（）
|'WriteInputText'>>beam.io.WriteToText（已知参数输出））

class-ExtractWordsFn（beam.DoFn）：
def流程（自身、要素）：
words=re.findall（r'[0-9]+'，元素）
yield tuple（words）

一个快速的纯Python解决方案是：

import json

with open('path/to/my/file.json','rb') as fh:
    lines = [json.loads(l) for l in fh.readlines()]

# [{'tax_pd': '200003', 'ein': '720378282'}, {'tax_pd': '200012', 'ein': '274027765'}, {'tax_pd': '200012', 'ein': '042746989'}, {'tax_pd': '200012', 'ein': '205993971'}]

查看您的数据，您没有唯一的键来执行key:value by

tax\u pd

和

ein

。假设会发生碰撞，可以执行以下操作：

myresults = {}

for line in lines:
    # I'm assuming we want to use tax_pd as the key, and ein as the value, but this can be extended to other keys

    # This will return None if the tax_pd is not already found
    if not myresults.get(line.get('tax_pd')):
        myresults[line.get('tax_pd')] = [line.get('ein')]
    else:
        myresults[line.get('tax_pd')] = list(set([line.get('ein'), *myresults[line.get('tax_pd')]))

#results
#{'200003': ['720378282'], '200012': ['205993971', '042746989', '274027765']}

这样您就拥有了唯一的键，以及相应的唯一

ein

值列表。不完全确定这是不是你想要的

set

将自动删除列表中的重复数据，包装

list

将重新转换数据类型

然后，您可以通过

tax\u id

明确查找：

myresults.get('200012')
# ['205993971', '042746989', '274027765']

编辑：要从云存储中读取，代码片段翻译为更易于使用：

with gcs.open(filename) as fh:
    lines = fh.read().split('\n')

您可以使用他们的api文档设置gcs对象

看起来您已经在字典中找到了它。您可能想要使用

json.load

或类似的东西。如何从200003获得201208？您能指定所需的精确格式吗？我将在Post中添加一些代码片段示例代码片段中的数字不匹配，因为我没有给出匹配值，抱歉，假设这些数字相同。再问一个问题，当使用google cloud bucket中的文档时，如何使用此方法？您可以像上面使用

ExtractWords

所做的那样将其写入类中。这似乎是beam.Pipeline方法的一部分。记住这一点，你可以从类/函数中生成整个字典，就像你在google cloud store connectivity的代码中添加的

生成元组（单词）

一样，该对象的设置在文档中。我添加

.split（'\n'）

的原因是

read（）

返回整个文档的字符串表示形式。在换行符上拆分将为您提供所需的列表结构want@jmoore255请