Scala spark，输入数据帧，返回所有值都等于1的列_Scala_Dataframe_Apache Spark_Filter_Apache Spark Sql

Scala spark，输入数据帧，返回所有值都等于1的列

scala dataframe apache-spark filter

Scala spark，输入数据帧，返回所有值都等于1的列,scala,dataframe,apache-spark,filter,apache-spark-sql,Scala,Dataframe,Apache Spark,Filter,Apache Spark Sql,给定一个数据帧，假设它包含4列和3行。我想写一个函数来返回列，其中该列中的所有值都等于1 这是一个Scala代码。我想使用一些spark转换来转换或过滤数据帧输入。此过滤器应在函数中实现 case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral) val example = Seq( Grade(1,3,1,1), Grade(1,1,null,1), Grade(1,10,

给定一个数据帧，假设它包含4列和3行。我想写一个函数来返回列，其中该列中的所有值都等于1

这是一个Scala代码。我想使用一些spark转换来转换或过滤数据帧输入。此过滤器应在函数中实现

case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
      Grade(1,3,1,1),
      Grade(1,1,null,1),
      Grade(1,10,2,1)
    )

    val dfInput = spark.createDataFrame(example)

调用函数filterColumns之后

val dfOutput = dfInput.filterColumns()

它应该返回3行2列数据帧，值均为1。

其中一个选项是rdd上的reduce：

使用Dataset[Grade]的可读性更强的方法

grade.dropWhenNotEqualsTo1->返回一个新的等级，其值不满足替换为空的条件列在列上迭代

tmp.selectcolumn.na.drop->删除带有空值的行 e、 g对于c2，这将返回

如果rowsCount==不包含NullScont colsToRetain+=ColColColumn->如果列包含Null，则将其删除

我将尝试准备数据集，以便在没有空值的情况下进行处理。如果列数不多，这种简单的迭代方法可能会很好地工作。请不要忘记在导入spark.implicits之前导入spark implicits

结果是：

+---+---+
| c1| c4|
+---+---+
|  1|  1|
|  1|  1|
|  1|  1|
+---+---+

如果空值不可避免，请使用非类型化数据集（也称为数据帧）：

val schema = StructType(Seq(
    StructField("c1", IntegerType, nullable = true),
    StructField("c2", IntegerType, nullable = true),
    StructField("c3", IntegerType, nullable = true),
    StructField("c4", IntegerType, nullable = true)
))

val example = spark.sparkContext.parallelize(Seq(
    Row(1,3,1,1),
    Row(1,1,null,1),
    Row(1,10,2,1)
))

val dfInput = spark.createDataFrame(example, schema).cache()

def allOnes(colName: String, df: DataFrame): Boolean = {
    val row = df.select(colName).distinct().collect()
    if (row.length == 1 && row.head.getInt(0) == 1) true
    else false
}

val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

你能解释一下最后两行吗？谢谢。zip创建元组列表，其中。_1元素是列名，而。_2来自diff、_1,1\u 2、_3、_4，，然后在映射步骤中，我用筛选出记录并仅返回列名。最后，我使用以下命令将这些列名放入select as varargs:*您能告诉我需要导入哪些包吗？colsToRetain+=colcolumn有一些错误，你能解释一下col和+=，它在我的电脑中显示了错误。import org.apache.spark.sql.functions.col，import scala.collection.mutable，import org.apache.spark.sql.Column应该足够谢谢。地图功能需要编码器吗？这是什么意思？您能将输出数据帧的顺序更改为与输入相同的顺序吗？比如说c1，c4，而不是c4，c1about编码器-它们可以作为import sparkSession.implicits.的隐式转换使用。要保留顺序，请使用List而不是Set

import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column

val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()

val colsToRetain = mutable.Set[Column]()
 for (column <- tmp.columns) {
   val withoutNullsCount = tmp.select(column).na.drop().count()
   if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}

dfInput.select(colsToRetain.toArray:_*).show()

+---+---+
| c4| c1|
+---+---+
|  1|  1|
|  1|  1|
|  1|  1|
+---+---+

case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
  def dropWhenNotEqualsTo(n: Integer): Grade = {
    Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
  }
  def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}

+---+----+----+---+
| c1|  c2|  c3| c4|
+---+----+----+---+
|  1|null|   1|  1|
|  1|   1|null|  1|
|  1|null|null|  1|
+---+----+----+---+

+---+
| c2|
+---+
|  1|
+---+

val example = spark.sparkContext.parallelize(Seq(
    Grade(1,3,1,1),
    Grade(1,1,0,1),
    Grade(1,10,2,1)
)).toDS().cache()

def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
    val row = ds.select(colName).distinct().collect()
    if (row.length == 1 && row.head.getInt(0) == 1) true
    else false
}

val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()

+---+---+
| c1| c4|
+---+---+
|  1|  1|
|  1|  1|
|  1|  1|
+---+---+

val schema = StructType(Seq(
    StructField("c1", IntegerType, nullable = true),
    StructField("c2", IntegerType, nullable = true),
    StructField("c3", IntegerType, nullable = true),
    StructField("c4", IntegerType, nullable = true)
))

val example = spark.sparkContext.parallelize(Seq(
    Row(1,3,1,1),
    Row(1,1,null,1),
    Row(1,10,2,1)
))

val dfInput = spark.createDataFrame(example, schema).cache()

def allOnes(colName: String, df: DataFrame): Boolean = {
    val row = df.select(colName).distinct().collect()
    if (row.length == 1 && row.head.getInt(0) == 1) true
    else false
}

val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()