Scala spark,输入数据帧,返回所有值都等于1的列
给定一个数据帧,假设它包含4列和3行。我想写一个函数来返回列,其中该列中的所有值都等于1 这是一个Scala代码。我想使用一些spark转换来转换或过滤数据帧输入。此过滤器应在函数中实现Scala spark,输入数据帧,返回所有值都等于1的列,scala,dataframe,apache-spark,filter,apache-spark-sql,Scala,Dataframe,Apache Spark,Filter,Apache Spark Sql,给定一个数据帧,假设它包含4列和3行。我想写一个函数来返回列,其中该列中的所有值都等于1 这是一个Scala代码。我想使用一些spark转换来转换或过滤数据帧输入。此过滤器应在函数中实现 case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral) val example = Seq( Grade(1,3,1,1), Grade(1,1,null,1), Grade(1,10,
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
调用函数filterColumns之后
val dfOutput = dfInput.filterColumns()
它应该返回3行2列数据帧,值均为1。其中一个选项是rdd上的reduce:
使用Dataset[Grade]的可读性更强的方法 grade.dropWhenNotEqualsTo1->返回一个新的等级,其值不满足替换为空的条件 列在列上迭代 tmp.selectcolumn.na.drop->删除带有空值的行 e、 g对于c2,这将返回 如果rowsCount==不包含NullScont colsToRetain+=ColColColumn->如果列包含Null,则将其删除
我将尝试准备数据集,以便在没有空值的情况下进行处理。如果列数不多,这种简单的迭代方法可能会很好地工作。请不要忘记在导入spark.implicits之前导入spark implicits 结果是:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
如果空值不可避免,请使用非类型化数据集(也称为数据帧):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()
你能解释一下最后两行吗?谢谢。zip创建元组列表,其中。_1元素是列名,而。_2来自diff、_1,1\u 2、_3、_4,,然后在映射步骤中,我用筛选出记录并仅返回列名。最后,我使用以下命令将这些列名放入select as varargs:*您能告诉我需要导入哪些包吗?colsToRetain+=colcolumn有一些错误,你能解释一下col和+=,它在我的电脑中显示了错误。import org.apache.spark.sql.functions.col,import scala.collection.mutable,import org.apache.spark.sql.Column应该足够谢谢。地图功能需要编码器吗?这是什么意思?您能将输出数据帧的顺序更改为与输入相同的顺序吗?比如说c1,c4,而不是c4,c1about编码器-它们可以作为import sparkSession.implicits.的隐式转换使用。要保留顺序,请使用List而不是Set
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
+---+
| c2|
+---+
| 1|
+---+
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()