从scala Spark中的RDD[type]获取不同的行
假设我有一个RDD格式,如RDD[employee]和示例数据,如下所示:-从scala Spark中的RDD[type]获取不同的行,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,假设我有一个RDD格式,如RDD[employee]和示例数据,如下所示:- FName,LName,Department,Salary dubert,tomasz ,paramedic i/c,91080.00, edwards,tim p,lieutenant,114846.00, edwards,tim p,lieutenant,234846.00, edwards,tim p,lieutenant,354846.00, elkins,eric j,police,104628.00, es
FName,LName,Department,Salary
dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,114846.00,
edwards,tim p,lieutenant,234846.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,53076.00,
ewing,marie a,clerk,13076.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51
fitch,jordan m,law clerk,14.51
预期产出:-
dubert,tomasz ,paramedic i/c,91080.00,
edwards,tim p,lieutenant,354846.00,
elkins,eric j,police,104628.00,
estrada,luis f,police officer,96060.00,
ewing,marie a,clerk,63076.00,
finn,sean p,firefighter,87006.00,
fitch,jordan m,law clerk,14.51
我希望每一行都基于不同的Fname我想您希望这样做:
df
.groupBy('Fname)
.agg(
first('LName),
first('Department),
first('Salary)
)
嘿,你不认为groupBy是一个非常繁重的操作,对一个批进行分组意味着对所有行进行迭代,这就是logn操作。此分组本身需要1-2秒。@Pinnacle group by是一个广泛使用的用户操作,在您的情况下是一个预期的解决方案。是的@AlexandrosBiratsis我理解,但我只是想知道,如果不使用group by,是否还有其他方法可行,因为在我的情况下,它会产生开销和处理时间。也许这是唯一的办法。谢谢你的回复。