Dataframe Spark:如何根据子集条件过滤数据
我有两个表,p_到v映射,g_到v映射Dataframe Spark:如何根据子集条件过滤数据,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,我有两个表,p_到v映射,g_到v映射 scala> val p_to_v = Seq(("p1", "v1"), ("p1", "v2"), ("p2", "v1")).toDF("p", "v") scala> p_to_v.show +---+---+ | p| v| +---+---+ | p1| v1| | p1| v2| | p2| v1| +---+---+ “p1”被映射到[v1,v2] “p2”映射到[v1] scala> val g_to_v = Se
scala> val p_to_v = Seq(("p1", "v1"), ("p1", "v2"), ("p2", "v1")).toDF("p", "v")
scala> p_to_v.show
+---+---+
| p| v|
+---+---+
| p1| v1|
| p1| v2|
| p2| v1|
+---+---+
“p1”被映射到[v1,v2]
“p2”映射到[v1]
scala> val g_to_v = Seq(("g1", "v1"), ("g2", "v1"), ("g2", "v2"), ("g3", "v2")).toDF("g", "v")
scala> g_to_v.show
+---+---+
| g| v|
+---+---+
| g1| v1|
| g2| v1|
| g2| v2|
| g3| v2|
+---+---+
“g1”映射到[v1]
scala> val g_to_v = Seq(("g1", "v1"), ("g2", "v1"), ("g2", "v2"), ("g3", "v2")).toDF("g", "v")
scala> g_to_v.show
+---+---+
| g| v|
+---+---+
| g1| v1|
| g2| v1|
| g2| v2|
| g3| v2|
+---+---+
“g2”映射到[v1,v2]
“g3”映射到[v2]
我想得到所有p和g的组合,其中p的对应v映射是g的v映射的子集
我怎样才能得到同样的结果呢?这很简单。您需要使用groupBy&然后使用简单的内部联接
scala> val p_to_v = Seq(("p1", "v1"), ("p1", "v2"), ("p2", "v1")).toDF("p", "v")
19/10/16 22:11:55 WARN metastore: Failed to connect to the MetaStore Server...
p_to_v: org.apache.spark.sql.DataFrame = [p: string, v: string]
scala> val g_to_v = Seq(("g1", "v1"), ("g2", "v1"), ("g2", "v2"), ("g3", "v2")).toDF("g", "v")
g_to_v: org.apache.spark.sql.DataFrame = [g: string, v: string]
现在进行分组操作
scala> val pv = p_to_v.groupBy($"p").agg(collect_list("v").as("pv"))
pv: org.apache.spark.sql.DataFrame = pv = [p: string, pv: array<string>]
scala> val gv = g_to_v.groupBy($"g").agg(collect_list("v").as("gv"))
gv: org.apache.spark.sql.DataFrame = [g: string, gv: array<string>]
scala> pv.show
+---+--------+
| p| pv|
+---+--------+
| p2| [v1]|
| p1|[v1, v2]|
+---+--------+
scala> gv.show
+---+--------+
| g| gv|
+---+--------+
| g2|[v2, v1]|
| g3| [v2]|
| g1| [v1]|
+---+--------+
或加入条件
pv.join(gv, pv("pv") === gv("gv") || subLisUDF($"pv", $"gv")).show
+---+--------+---+--------+
| p| pv| g| gv|
+---+--------+---+--------+
| p2| [v1]| g2|[v1, v2]|
| p1|[v1, v2]| g2|[v1, v2]|
| p2| [v1]| g1| [v1]|
+---+--------+---+--------+
两种方法都试一下,取性能最好的一种。你能补充一下你想看输出的方式吗?在你的例子中,我看不到“g3”。输出g3与[v2]链接,并且没有对应的v是[v2]子集的p。@SarathChandraVema我只需要g和p,所以以g和p作为列的数据帧就足够了。请将此标记为答案,如果这解决了您的问题,请登录以获得答复。p2和g2组合缺失。p2的v映射为[v1],g2的v映射为[v1,v2]。所以p2的v是g2的vSorry的子集,遗漏了一些部分,现在通过所需的更新进行编辑
spark.conf.set("spark.sql.crossJoin.enabled", "true")
pv.join(gv).withColumn("newdsa", subLisUDF($"pv", $"gv")).filter($"newdsa").show
+---+--------+---+--------+------+
| p| pv| g| gv|newdsa|
+---+--------+---+--------+------+
| p2| [v1]| g2|[v2, v1]| true|
| p1|[v1, v2]| g2|[v2, v1]| true|
| p2| [v1]| g1| [v1]| true|
+---+--------+---+--------+------+
pv.join(gv, pv("pv") === gv("gv") || subLisUDF($"pv", $"gv")).show
+---+--------+---+--------+
| p| pv| g| gv|
+---+--------+---+--------+
| p2| [v1]| g2|[v1, v2]|
| p1|[v1, v2]| g2|[v1, v2]|
| p2| [v1]| g1| [v1]|
+---+--------+---+--------+