Apache spark 将多个groupBy函数合并为1_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Apache spark 将多个groupBy函数合并为1

apache-spark pyspark

Apache spark 将多个groupBy函数合并为1,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,使用此代码查找模式：将numpy作为np导入 np.random.seed1 df2=sc.parallelize[ intx，对于np.random.randint50中的x，大小=10000 ].toDF[x] cnts=df2.groupByx.count 模式=cnts.join cnts.aggmaxcount.aliasmax，colcount==colmax_ .limit1.selectx 模式。第一个[0] 从返回错误： --------------------------

使用此代码查找模式：

将numpy作为np导入 np.random.seed1 df2=sc.parallelize[ intx，对于np.random.randint50中的x，大小=10000 ].toDF[x] cnts=df2.groupByx.count 模式=cnts.join cnts.aggmaxcount.aliasmax，colcount==colmax_ .limit1.selectx 模式。第一个[0] 从

返回错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-53-2a9274e248ac> in <module>()
      8 cnts = df.groupBy("x").count()
      9 mode = cnts.join(
---> 10     cnts.agg(max("count").alias("max_")), col("count") == col("max_")
     11 ).limit(1).select("x")
     12 mode.first()[0]

AttributeError: 'str' object has no attribute 'alias'

因此，c1和c2的模态分别为2.0和3.0

这是否可以应用于dataframe中的所有列c1、c2、c3、c4、c5，而不是像我所做的那样显式选择每个列？

看起来您使用的是内置的max，而不是SQL函数

导入pyspark.sql.F函数 cnts.aggF.maxcount.aliasmax_ 要在同一类型的多个柱上查找模式，可以按照中的定义将形状重塑为长熔体：

meltdf，[]，df列按列和值计数 .groupByvariable，值计数每列查找模式 .groupByvariable .aggF.maxF.structcount，value.aliasmode 。选择变量，模式。值 +----+---+ |变量|值| +----+---+ |c5 | 6.0| |c1 | 2.0| |c4 | 5.0| |c3 | 4.0| |c2 | 3.0| +----+---+