Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/csharp/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从scala Spark中的RDD[type]获取不同的行_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

从scala Spark中的RDD[type]获取不同的行

从scala Spark中的RDD[type]获取不同的行,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,假设我有一个RDD格式,如RDD[employee]和示例数据,如下所示:- FName,LName,Department,Salary dubert,tomasz ,paramedic i/c,91080.00, edwards,tim p,lieutenant,114846.00, edwards,tim p,lieutenant,234846.00, edwards,tim p,lieutenant,354846.00, elkins,eric j,police,104628.00, es

假设我有一个RDD格式,如RDD[employee]和示例数据,如下所示:-

FName,LName,Department,Salary
dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,114846.00,
edwards,tim p,lieutenant,234846.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,53076.00,
ewing,marie a,clerk,13076.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51
fitch,jordan m,law clerk,14.51
预期产出:-

dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51

我希望每一行都基于不同的Fname

我想您希望这样做:

df
.groupBy('Fname)
.agg(
  first('LName),
  first('Department),
  first('Salary)
)

嘿,你不认为groupBy是一个非常繁重的操作,对一个批进行分组意味着对所有行进行迭代,这就是logn操作。此分组本身需要1-2秒。@Pinnacle group by是一个广泛使用的用户操作,在您的情况下是一个预期的解决方案。是的@AlexandrosBiratsis我理解,但我只是想知道,如果不使用group by,是否还有其他方法可行,因为在我的情况下,它会产生开销和处理时间。也许这是唯一的办法。谢谢你的回复。