Dictionary pyspark如何使用pyspark求和并生成前10名

Dictionary pyspark如何使用pyspark求和并生成前10名,dictionary,pyspark,Dictionary,Pyspark,我有一个csv文件,包含两个字段,一个键和一个值: {1Y4dZ123eAMGooBmVzBLUWEZ2JfCCUY91},8.530366 {1YdZ123433MGooBmVzBLUWEZ1234CUY91},8.530366 {1YdZ2344AMGooBmVzBLUWE123JfCCUY91},8.530366 {1YdECDNthiMGooBmVzBLUWEZ2JfCCUY91},8.530366 {1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366

我有一个csv文件,包含两个字段,一个键和一个值:

{1Y4dZ123eAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123433MGooBmVzBLUWEZ1234CUY91},8.530366
{1YdZ2344AMGooBmVzBLUWE123JfCCUY91},8.530366
{1YdECDNthiMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBDJTdBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123qeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBm123LUWEZ2JfCCUY91},8.530366
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{13uZ6tSr5oh1ui9Hd1tEqJKo2AHhJ6JdFS},0.03895804
我要做的是对第二列求和,并按第一列进行分组,然后导出具有最高值的前10个键

下面是我尝试使用的代码,但出现“元组索引超出范围”错误:

import re
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession(sc)


voutFile = sc.textFile("input/voutfiltered.csv")

features=voutFile.map(lambda l:
    (l.split(',')[0],float(l.split(',')[1])))

top10 = features.takeOrdered(10, key = lambda x: -x[2])
for record in top10:
    print("{}: {};{}".format(record[0],record[1],record[2]))```




您不使用DataFrame API的具体原因是什么?它比RDDAPI灵活、方便和快速得多

导入pyspark.sql.f函数
df=spark.read.format(“csv”).option(“header”、“true”).load(“/path/to/your/file.csv/”)
(df.groupBy(f.col(“key_col”))
.agg(f.count(f.col(“value\u col”))。别名(“count\u value\u col”))
.sort(col(“count\u value\u col”).desc()
.限额(10)
.show())

您的RDD元素是两项元组,很可能需要将x[2]调整为x[1]-->
top10=功能。takeOrdered(10,key=lambda x:-x[1])
并在打印功能中删除记录[2]。