Python Pyspark显示每行具有最低值的列
我有以下数据帧Python Pyspark显示每行具有最低值的列,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我有以下数据帧 我想从每一行的每一列中得到最低的值 这就是我到目前为止所能做到的 问题是Col4为空,因此无法计算最低值。 我要找的是这样的东西: 如果我得到的是最小值,而与空白列无关,并且如果有多个最小值,则列字段名称应以串联方式显示在最低的列标题中 +-----------------+----------+----+----+----+----+----+ |lowest_cols_title|lowest_col|Col1|Col2|Col3|Col4|Col5| +--------
我想从每一行的每一列中得到最低的值 这就是我到目前为止所能做到的
问题是Col4为空,因此无法计算最低值。 我要找的是这样的东西: 如果我得到的是最小值,而与空白列无关,并且如果有多个最小值,则列字段名称应以串联方式显示在最低的列标题中
+-----------------+----------+----+----+----+----+----+
|lowest_cols_title|lowest_col|Col1|Col2|Col3|Col4|Col5|
+-----------------+----------+----+----+----+----+----+
| Col1| 0| 0| 7| 8| | 20|
| Col1;Col2;Col3| 5| 5| 5| 5| | 28|
| Col1;Col2| -1| -1| -1| 13| | 83|
| Col1| -1| -1| 6| 6| | 18|
| Col3| 5| 5| 4| 2| | 84|
| Col1;Col2| 0| 0| 0| 14| 7| 86|
+-----------------+----------+----+----+----+----+----+
有几种方法可以避免你的空可乐
df_old_list = df_old_list.withColumn('Col1', F.col('Col1').cast('int'))
df_old_list = df_old_list.withColumn('Col2', F.col('Col2').cast('int'))
df_old_list = df_old_list.withColumn('Col3', F.col('Col3').cast('int'))
df_old_list = df_old_list.withColumn('Col4', F.col('Col4').cast('int'))
df_old_list = df_old_list.withColumn('Col5', F.col('Col5').cast('int'))
df_old_list.show()
+----+----+----+----+----+
|Col1|Col2|Col3|Col4|Col5|
+----+----+----+----+----+
| 0| 7| 8|null| 20|
| 5| 5| 5|null| 28|
| -1| -1| 13|null| 83|
| -1| 6| 6|null| 18|
| 5| 4| 2|null| 84|
| 0| 0| 14| 7| 86|
+----+----+----+----+----+
现在,您的代码将产生以下结果:
df1=df_old_list.selectExpr("*","array_sort(split(concat_ws(',',*),','))[0] lowest_col")
...:
...: df1.show()
+----+----+----+----+----+----------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|
+----+----+----+----+----+----------+
| 0| 7| 8|null| 20| 0|
| 5| 5| 5|null| 28| 28|
| -1| -1| 13|null| 83| -1|
| -1| 6| 6|null| 18| -1|
| 5| 4| 2|null| 84| 2|
| 0| 0| 14| 7| 86| 0|
+----+----+----+----+----+----------+
collist = df.columns
min_ = least(*[
when(col(c) == "", float("inf")).otherwise(col(c).cast('int'))
for c in df.columns
]).alias("lowest_col")
df = df.select("*", min_)
df = df.select("*",concat_ws(";",array([
when(col(c)==col("lowest_col") ,c).otherwise(None)
for c in collist
])).alias("lowest_cols_title") )
df.show(10,False)
输出:
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
|0 |7 |8 | |20 |0.0 |Col1 |
|5 |5 |5 | |28 |5.0 |Col1;Col2;Col3 |
|-1 |-1 |13 | |83 |-1.0 |Col1;Col2 |
|-1 |6 |6 | |18 |-1.0 |Col1 |
|5 |4 |2 | |84 |2.0 |Col3 |
|0 |0 |14 |7 |86 |0.0 |Col1;Col2 |
+----+----+----+----+----+----------+-----------------+
使用@venky\uuuuuuuu回答,我为您找到了一个解决方案:
from pyspark.sql import functions as F
join_key = df_old_list.columns
min_ = F.least(
*[F.when(F.col(c).isNull() | (F.col(c) == ""), float("inf")).otherwise(F.col(c).cast('int'))
for c in join_key]
).alias("lowest_col")
df_with_lowest_col = df_old_list.select("*", min_.cast('int'))
df_exploded = df_old_list.withColumn(
'vars_and_vals',
F.explode(F.array(
*(F.struct(F.lit(c).alias('var'), F.col(c).alias('val')) for c in join_key)
)))
cols = join_key + [F.col("vars_and_vals")[x].alias(x) for x in ['var', 'val']]
df_exploded = df_exploded.select(*cols)
df = df_exploded.join(df_with_lowest_col, join_key)
df = df.filter('val = lowest_col')
df_with_col_names = df.groupby(*join_key).agg(
F.array_join(F.collect_list('var'), ';').alias('lowest_cols_title')
)
res_df = df_with_lowest_col.join(df_with_col_names, join_key)
结果:
res_df.show()
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
| 0| 0| 14| 7| 86| 0| Col1;Col2|
| -1| 6| 6| | 18| -1| Col1|
| 5| 5| 5| | 28| 5| Col1;Col2;Col3|
| 0| 7| 8| | 20| 0| Col1|
| -1| -1| 13| | 83| -1| Col1;Col2|
| 5| 4| 2| | 84| 2| Col3|
+----+----+----+----+----+----------+-----------------+
它看起来很复杂,可以优化,但我认为它很有效。感谢您的详细回复,如果有多个最低值,我如何才能获得第一列最低值?谢谢您,我还需要一列,列:lower_cols_title显示了哪一列的值最低,并重复表示赞赏所做的努力:)这是一种有趣的方法,但如果数据量很大,则可能会出现问题。也可以在类似SQL的case when语句中执行(
,否则在spark中为
)。看到我的答案了。@venky_uuuu是的,我们的方法几乎是一样的,但我没有想到在循环中用语句来比较值,而不是创建结构然后融化。好主意@busfighter,谢谢你,是的,它看起来很复杂,但是我的这个数据集的记录限制在1000条左右,所以我不太担心:)
df1=df_old_list.selectExpr("*","array_sort(split(concat_ws(',',*),','))[0] lowest_col")
df1.show()
+----+----+----+----------+----+----------+
|Col1|Col2|Col3| Col4|Col5|lowest_col|
+----+----+----+----------+----+----------+
| 0| 7| 8|2147483647| 20| 0|
| 5| 5| 5|2147483647| 28|2147483647|
| -1| -1| 13|2147483647| 83| -1|
| -1| 6| 6|2147483647| 18| -1|
| 5| 4| 2|2147483647| 84| 2|
| 0| 0| 14| 7| 86| 0|
+----+----+----+----------+----+----------+
from pyspark.sql import Row
from pyspark.sql.functions import col,least,when,array,concat_ws
df_old_list= [
{ "Col1":"0", "Col2" : "7","Col3": "8", "Col4" : "","Col5": "20"}, {"Col1":"5", "Col2" : "5","Col3": "5", "Col4" : "","Col5": "28"},
{ "Col1":"-1", "Col2" : "-1","Col3": "13", "Col4" : "","Col5": "83"}, {"Col1":"-1", "Col2" : "6","Col3": "6", "Col4" : "","Col5": "18"},
{ "Col1":"5", "Col2" : "4","Col3": "2", "Col4" : "","Col5": "84"}, { "Col1":"0", "Col2" : "0","Col3": "14", "Col4" : "7","Col5": "86"}]
df = spark.createDataFrame(Row(**x) for x in df_old_list)
from pyspark.sql.functions import least, when
collist = df.columns
min_ = least(*[
when(col(c) == "", float("inf")).otherwise(col(c).cast('int'))
for c in df.columns
]).alias("lowest_col")
df = df.select("*", min_)
df = df.select("*",concat_ws(";",array([
when(col(c)==col("lowest_col") ,c).otherwise(None)
for c in collist
])).alias("lowest_cols_title") )
df.show(10,False)
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
|0 |7 |8 | |20 |0.0 |Col1 |
|5 |5 |5 | |28 |5.0 |Col1;Col2;Col3 |
|-1 |-1 |13 | |83 |-1.0 |Col1;Col2 |
|-1 |6 |6 | |18 |-1.0 |Col1 |
|5 |4 |2 | |84 |2.0 |Col3 |
|0 |0 |14 |7 |86 |0.0 |Col1;Col2 |
+----+----+----+----+----+----------+-----------------+
from pyspark.sql import functions as F
join_key = df_old_list.columns
min_ = F.least(
*[F.when(F.col(c).isNull() | (F.col(c) == ""), float("inf")).otherwise(F.col(c).cast('int'))
for c in join_key]
).alias("lowest_col")
df_with_lowest_col = df_old_list.select("*", min_.cast('int'))
df_exploded = df_old_list.withColumn(
'vars_and_vals',
F.explode(F.array(
*(F.struct(F.lit(c).alias('var'), F.col(c).alias('val')) for c in join_key)
)))
cols = join_key + [F.col("vars_and_vals")[x].alias(x) for x in ['var', 'val']]
df_exploded = df_exploded.select(*cols)
df = df_exploded.join(df_with_lowest_col, join_key)
df = df.filter('val = lowest_col')
df_with_col_names = df.groupby(*join_key).agg(
F.array_join(F.collect_list('var'), ';').alias('lowest_cols_title')
)
res_df = df_with_lowest_col.join(df_with_col_names, join_key)
res_df.show()
+----+----+----+----+----+----------+-----------------+
|Col1|Col2|Col3|Col4|Col5|lowest_col|lowest_cols_title|
+----+----+----+----+----+----------+-----------------+
| 0| 0| 14| 7| 86| 0| Col1;Col2|
| -1| 6| 6| | 18| -1| Col1|
| 5| 5| 5| | 28| 5| Col1;Col2;Col3|
| 0| 7| 8| | 20| 0| Col1|
| -1| -1| 13| | 83| -1| Col1;Col2|
| 5| 4| 2| | 84| 2| Col3|
+----+----+----+----+----+----------+-----------------+