Dataframe 如何获取spark行的值_计数?
我有一个spark数据框,3列存储3种不同的预测。我想知道每个输出值的计数,以便选择最大次数获得的值作为最终输出 在pandas中,我可以通过为每一行调用lambda函数来轻松实现这一点,以获得如下所示的值_计数。我已经在这里将spark df转换为pandas df,但我需要能够直接在spark df上执行类似的操作Dataframe 如何获取spark行的值_计数?,dataframe,apache-spark,pyspark,Dataframe,Apache Spark,Pyspark,我有一个spark数据框,3列存储3种不同的预测。我想知道每个输出值的计数,以便选择最大次数获得的值作为最终输出 在pandas中,我可以通过为每一行调用lambda函数来轻松实现这一点,以获得如下所示的值_计数。我已经在这里将spark df转换为pandas df,但我需要能够直接在spark df上执行类似的操作 r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)] df1=spark.createDataFrame(r) df
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
输出告诉我值1出现的次数最多,在本例中是两次
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
我正在尝试编写一个udf函数,它可以获取每个df1行并获得top_val和top_val。有没有一种方法可以使用spark df实现这一点?python的代码应该类似,也许它会对您有所帮助
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
输出:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
让我们有一个类似于您的测试数据帧
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
或者,当列表中的每个项目都不同时,您需要处理一个案例。i、 e返回空值
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
@udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
将计数器初始化为“1”,并仅当其大于“1”时返回计数
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+
谢谢@AndrzejS。我只是在找这样的东西:-)