Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Spark scala中基于过滤器将数据集分成两部分_Scala_Apache Spark - Fatal编程技术网

如何在Spark scala中基于过滤器将数据集分成两部分

如何在Spark scala中基于过滤器将数据集分成两部分,scala,apache-spark,Scala,Apache Spark,是否可以使用单个过滤器操作将DF分为两部分。例如 假设df有以下记录 UID Col 1 a 2 b 3 c 如果我这样做 df1 = df.filter(UID <=> 2) 如果您只对保存数据感兴趣,可以在数据框中添加一个指示符列: val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col") val dfWithInd = df.withColumn("ind",

是否可以使用单个过滤器操作将DF分为两部分。例如

假设df有以下记录

UID    Col
 1       a
 2       b
 3       c
如果我这样做

df1 = df.filter(UID <=> 2)

如果您只对保存数据感兴趣,可以在数据框中添加一个指示符列:

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
它将在写入时创建两个单独的目录(
ind=false
ind=true

但是,一般来说,不可能从单个转换中产生多个
rdd
DataFrames
。看

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
dfWithInd.write.partitionBy("ind").parquet(...)