Python（Pyspark）嵌套列表reduceByKey，pythonlistappend创建嵌套列表_Python_List_Pyspark_Nested

Python（Pyspark）嵌套列表reduceByKey，pythonlistappend创建嵌套列表

python list pyspark

Python（Pyspark）嵌套列表reduceByKey，pythonlistappend创建嵌套列表,python,list,pyspark,nested,Python,List,Pyspark,Nested,我有一个RDD输入，格式如下： [('2002', ['cougar', 1]), ('2002', ['the', 10]), ('2002', ['network', 4]), ('2002', ['is', 1]), ('2002', ['database', 13])] “2002年”是关键。因此，我有键值对，如下所示： ('year', ['word', count]) Count是整数，我想使用reduceByKey得到以下结果： [('2002, [['cougar', 1]

我有一个RDD输入，格式如下：

[('2002', ['cougar', 1]),
('2002', ['the', 10]),
('2002', ['network', 4]),
('2002', ['is', 1]),
('2002', ['database', 13])]

“2002年”是关键。因此，我有键值对，如下所示：

 ('year', ['word', count])

Count是整数，我想使用reduceByKey得到以下结果：

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

为了得到如上所述的鸟巢列表，我做了很多努力。主要问题是如何获取嵌套列表。我有三份清单a、b和c

a = ['cougar', 1]
b = ['the', 10]
c = ['network', 4]

a.append(b)

将返回一个as

 ['cougar', 1, ['the', 10]]

及

将返回x作为

  [['cougar', 1], ['the', 10]]

然而，如果

  c.append(x)

将返回c作为

  ['network', 4, [['cougar', 1], ['the', 10]]]

所有上述操作都没有得到预期的结果

我想去

   [('2002', [[word1, c1],[word2, c2], [word3, c3], ...]), 
   ('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

i、 e嵌套列表应为：

  [a, b, c]

其中a、b、c本身是包含两个元素的列表

我希望问题清楚，有什么建议吗？

我提出了一个解决方案：

def wagg(a,b):  
    if type(a[0]) == list: 
        if type(b[0]) == list:
            a.extend(b)
        else: 
            a.append(b)
        w = a
    elif type(b[0]) == list: 
        if type(a[0]) == list:
            b.extend(a)
        else:    
            b.append(a)
        w = b
    else: 
        w = []
        w.append(a)
        w.append(b)
    return w  


rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b))

有谁有更好的解决方案吗？

这个问题不需要使用ReduceByKey

定义RDD

rdd=sc.parallelize（[（'2002'，['cougar'，1]），（'2002'，['the'，10]），（'2002'，['network'，4]），（'2002'，['is'，1]），（'2002'，['database'，13]），

请参见带有
```
rdd.collect（）
```
：

[（'2002'，['cougar'，1]），（'2002'，['the'，10]），（'2002'，['network'，4]），（'2002'，['is'，1]），（'2002'，['database'，13]）

应用groupByKey函数并将值映射为列表，如中所示

rdd\u nested=rdd.groupByKey（）.mapValues（列表）

请参见RDD分组值
```
rdd\u嵌套的.collect（）
```
：

[（'2002'，[['cougar'，1]，'the'，10]，'network'，4]，'is'，1]，'database'，13]]]

为什么你必须使用ReduceByKey而不是简单的rdd\u nested=rdd.groupByKey（）。很好用。@Yu Xiang，很好！我贴了一个答案，如果你认为答案正确，你可以把它标记为接受。

def wagg(a,b):  
    if type(a[0]) == list: 
        if type(b[0]) == list:
            a.extend(b)
        else: 
            a.append(b)
        w = a
    elif type(b[0]) == list: 
        if type(a[0]) == list:
            b.extend(a)
        else:    
            b.append(a)
        w = b
    else: 
        w = []
        w.append(a)
        w.append(b)
    return w  


rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b))