Apache spark 基于pyspark中的值对rdd进行分组_Apache Spark_Pyspark_Rdd

Apache spark 基于pyspark中的值对rdd进行分组

apache-spark pyspark

Apache spark 基于pyspark中的值对rdd进行分组,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,我已经创建了一个rdd，并使用以下内容打印结果： finalRDD = replacetimestampRDD.map(lambda x: (x[1], x[0:])) print("Partitions structure: {}".format(finalRDD.glom().collect())) 输出（示例）：我试着按键对结果进行分组（按键我指的是‘a’、‘b’、‘c’）。期望输出： Partitions structure: [[('a', [['2020-05-22 15:17:

我已经创建了一个rdd，并使用以下内容打印结果：

finalRDD = replacetimestampRDD.map(lambda x: (x[1], x[0:]))
print("Partitions structure: {}".format(finalRDD.glom().collect()))

输出（示例）：

我试着按键对结果进行分组（按键我指的是‘a’、‘b’、‘c’）。期望输出：

Partitions structure: [[('a', [['2020-05-22 15:17:10', 'John', '9535175'],['2020-05-22 15:17:10', 'Paul', '9615224']]), 
                        ('b', ['2020-05-22 15:17:10', 'Nick', '7383554',]),
                        ('c', ['2020-05-22 15:17:10', 'George', '8915433'])
                          ]]

我尝试使用

results=finalRDD.groupByKey（）.collect（）

，但它似乎不起作用

有人能帮我吗

您可以在

groupByKey（）

之后使用

mapValues（）

创建值列表：

rdd.groupByKey().mapValues(list).collect()

输出：

[('a',
  [['2020-05-22 15:17:10', 'John', '9535175'],
   ['2020-05-22 15:17:10', 'Paul', '9615224']]),
 ('b', [['2020-05-22 15:17:10', 'Nick', '7383554']]),
 ('c', [['2020-05-22 15:17:10', 'George', '8915433']])]

RDD的结构是列表的列表吗？您给出的示例输出，它是显示单个输出元素还是应该显示包含4个元素的列表？是的，结构如列表所示。示例输出必须是包含4个元素的列表

[('a',
  [['2020-05-22 15:17:10', 'John', '9535175'],
   ['2020-05-22 15:17:10', 'Paul', '9615224']]),
 ('b', [['2020-05-22 15:17:10', 'Nick', '7383554']]),
 ('c', [['2020-05-22 15:17:10', 'George', '8915433']])]