Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays PySpark-RDD到JSON_Arrays_Json_Pyspark - Fatal编程技术网

Arrays PySpark-RDD到JSON

Arrays PySpark-RDD到JSON,arrays,json,pyspark,Arrays,Json,Pyspark,我有一个配置单元查询,它以以下格式返回数据: ip, category, score 1.2.3.4, X, 5 10.10.10.10, A, 2 1.2.3.4, Y, 2 12.12.12.12, G, 10 1.2.3.4, Z, 9 10.10.10.10, X, 3 在PySpark中,我是通过hive\u context.sql(my\u query.rdd)得到这个结果的 每个ip地址可以有多个分数(因此有多行)。我希望获得json/数组格式的数据,如下所示: { "i

我有一个配置单元查询,它以以下格式返回数据:

ip, category, score
1.2.3.4, X, 5
10.10.10.10, A, 2
1.2.3.4, Y, 2
12.12.12.12, G, 10
1.2.3.4, Z, 9
10.10.10.10, X, 3
在PySpark中,我是通过
hive\u context.sql(my\u query.rdd)得到这个结果的

每个ip地址可以有多个分数(因此有多行)。我希望获得json/数组格式的数据,如下所示:

{
    "ip": "1.2.3.4",
    "scores": [
        {
            "category": "X",
             "score": 10
        },
        {
            "category": "Y",
             "score": 2
        },
        {
            "category": "Z",
             "score": 9
        },
    ],
    "ip": "10.10.10.10",
    "scores": [
        {
            "category": "A",
             "score": 2
        },
        {
            "category": "X",
             "score": 3
        },
    ],
     "ip": "12.12.12.12",
    "scores": [
        {
            "category": "G",
             "score": 10
        },
    ],
}

请注意,RDD不一定是经过排序的,RDD可以轻松地包含数亿行。我是PySpark的新手,因此任何关于如何有效进行此操作的建议都会有所帮助。

groupBy
ip
,然后将分组的RDD转换为您需要的:

rdd.groupBy(lambda r: r.ip).map(
  lambda g: {
    'ip': g[0], 
    'scores': [{'category': x['category'], 'score': x['score']} for x in g[1]]}
).collect()

# [{'ip': '1.2.3.4', 'scores': [{'category': 'X', 'score': 5}, {'category': 'Y', 'score': 2}, {'category': 'Z', 'score': 9}]}, {'ip': '12.12.12.12', 'scores': [{'category': 'G', 'score': 10}]}, {'ip': '10.10.10.10', 'scores': [{'category': 'A', 'score': 2}, {'category': 'X', 'score': 3}]}]