Python 通过匹配spark rdd中的小写键来减少_Python_Apache Spark_Rdd

Python 通过匹配spark rdd中的小写键来减少

python apache-spark

Python 通过匹配spark rdd中的小写键来减少,python,apache-spark,rdd,Python,Apache Spark,Rdd,我有一个（键，值）对的rdd，键是字符串，值是字符串的出现次数 words.take(10) Out[98]: [('The', 2767), ('Project', 83), ('the', 3), ('of', 14941), ('Leo', 4), ('is', 3245), ('use', 80), ('anyone', 191), ('Of', 25), ('at', 4235)] 我想按key.lower（）匹配键，求和它们的值，并保留每个大写\小写键的原始值

我有一个（键，值）对的rdd，键是字符串，值是字符串的出现次数

words.take(10)

Out[98]: [('The', 2767),
 ('Project', 83),
 ('the', 3),
 ('of', 14941),
 ('Leo', 4),
 ('is', 3245),
 ('use', 80),
 ('anyone', 191),
 ('Of', 25),
 ('at', 4235)]

我想按key.lower（）匹配键，求和它们的值，并保留每个大写\小写键的原始值

此外，我想过滤掉非重复键

因此，我对上述示例words.take（10）的输出将是：

您可以将

groupby

与

collect_list

和

filter

一起使用，如下所示

from pyspark.sql import functions as f

data = [
    ('The', 2767),
    ('Project', 83),
    ('the', 3),
    ('of', 14941),
    ('Leo', 4),
    ('is', 3245),
    ('use', 80),
    ('anyone', 191),
    ('Of', 25),
    ('at', 4235)
]

df = spark.createDataFrame(data).toDF(*["word", "count"])

df.groupby(f.lower("word").alias("word")) \
  .agg(f.collect_list(f.struct("word", "count")).alias("list"), f.sum("count").alias("sum")) \
  .filter(f.size("list") > 1) \
  .select("list", "sum") \
  .show(truncate=False)

输出：

+-----------------------+-----+
|list                   |sum  |
+-----------------------+-----+
|[{The, 2767}, {the, 3}]|2770 |
|[{of, 14941}, {Of, 25}]|14966|
+-----------------------+-----+

“f”是指什么？f、下，f.collect_列表…请从pyspark导入函数作为

。sql导入函数作为f

+-----------------------+-----+
|list                   |sum  |
+-----------------------+-----+
|[{The, 2767}, {the, 3}]|2770 |
|[{of, 14941}, {Of, 25}]|14966|
+-----------------------+-----+