Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql 阿格。带火花中的过滤器和groupby_Sql_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Sql 阿格。带火花中的过滤器和groupby

Sql 阿格。带火花中的过滤器和groupby,sql,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我正在基于groupBy条件进行聚合,并对现有的spark/scala数据帧应用一些过滤器。但在执行代码时,我得到了“无法解析”标志“给定的输入列:” 有人能指导我如何重写代码吗 val someDF = Seq( (1, 111,100,100,"C","5th","Y",11), (1, 111,100,100,"C","5th","Y",11)

我正在基于groupBy条件进行聚合,并对现有的spark/scala数据帧应用一些过滤器。但在执行代码时,我得到了“无法解析”标志“给定的输入列:”

有人能指导我如何重写代码吗

    val someDF = Seq(
     (1, 111,100,100,"C","5th","Y",11),
     (1, 111,100,100,"C","5th","Y",11),
     (2, 222,200,200,"C","5th","Y",22),
     (2, 222,200,200,"C","5th","Y",22)
     ).toDF("id","rollno","sub1","sub2","flag","class","status","sno")

    var df2 = someDF.groupBy("id","rollno")
    .agg(sum("sub1").alias("sub1"),sum("sub2").alias("sub2"))
   .filter(col("flag") === "C")
   .filter(length(col("rollno")) >= 2)
   .filter(col("class") === ("5th") || col("class") === ("6th"))
   .filter(substring(col("rollno"), 1, 2) === col("sno"))
   .filter(col("status") === "Y")
   .select("id", "rollno", "sub1", "sub2", "flag", "class", "sno", "status")

  Error:

    org.apache.spark.sql.AnalysisException: cannot resolve '`flag`' given input columns: [id, rollno, sub1, sub2];;
   'Filter ('flag = C)


    Expected Result:
    +---+------+----+----+----+-----+------+---+
    | id|rollno|sub1|sub2|flag|class|status|sno|
    +---+------+----+----+----+-----+------+---+
    |  1|   111| 200| 200|   C|  5th|     Y| 11|
    |  2|   222| 400| 400|   C|  5th|     Y| 22|
    +---+------+----+----+----+-----+------+---+

聚合后,其他列已消失,因此无法基于这些列进行筛选。您需要在分组之前进行筛选。如果要保留其他列,还需要按其他列分组

var df2 = someDF
   .filter(col("flag") === "C")
   .filter(length(col("rollno")) >= 2)
   .filter(col("class") === ("5th") || col("class") === ("6th"))
   .filter(substring(col("rollno"), 1, 2) === col("sno"))
   .filter(col("status") === "Y")
   .groupBy("id", "rollno", "flag", "class", "sno", "status")
   .agg(sum("sub1").alias("sub1"),sum("sub2").alias("sub2"))
   .select("id", "rollno", "sub1", "sub2", "flag", "class", "sno", "status")

df2.show
+---+------+----+----+----+-----+---+------+
| id|rollno|sub1|sub2|flag|class|sno|status|
+---+------+----+----+----+-----+---+------+
|  1|   111| 200| 200|   C|  5th| 11|     Y|
|  2|   222| 400| 400|   C|  5th| 22|     Y|
+---+------+----+----+----+-----+---+------+

分组前应用筛选器