Scala ApacheSpark列与收集的信息vs行联合
我有以下数据帧:Scala ApacheSpark列与收集的信息vs行联合,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下数据帧: +------+------------------+--------------+-------------+ | name| email| phone| country| +------+------------------+--------------+-------------+ | Mike| mike@example.com|+91-9999999999| Italy| | Alex| ale
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
应用以下转换后:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"FROM $table WHERE $tagCondition"
}
val userTags = tags.map {
case (tag, tagCondition) => {
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
}
}
val unionDf = userTags.foldLeft(userTags.head) {
case (acc, df) => acc.union(df)
}
我收到以下数据帧:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
它使用标记列中的附加信息复制每个原始数据帧记录,但我需要类似的内容(不是原始数据帧中的重复记录和标记列中的标记集合):
现在我不知道如何更改我的转换,以便接收这样一个带有tag
列(如ArrayType
)的结构,而不需要原始行复制 这里有一种不改变太多逻辑的可能方法
首先,您必须为users表分配一个唯一的id。如下图所示:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
现在,您的标签
和用户标签
仍与上面相同
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"FROM $table WHERE $tagCondition"
}
这里我们只在UniqueID
和tag
列中进行选择
val userTags = tags.map {
case (tag, tagCondition) => {
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
}
}
这是非常重要的。使用foldLeft的原始代码中有一个微妙的错误。在你的情况下,名单的头被折叠了两次。我在这里所做的是将head选择到一个单独的变量中,然后将其从userTags
中删除。折叠逻辑与之前相同。但在这种情况下,我们不会折叠头部元素两次
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) {
case (acc, df) => acc.union(df)
}
现在我们通过UniqueID
列进行分组,同时将标记聚合到自己的列表中
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
最后,我们将您的users
表与前面分配的UniqueID(即tableuserswithId
)和前面的数据帧连接起来,以获得最终结果
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
最终结果如下:
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
您可以发布代码buildTagQuery(tag,tagCondition,“taged_users”)
,也许我们可以创建一个UDF
,生成tag
列,通过它我们可以避免数据帧上的联合、映射和折叠。或者在sql
查询中添加所有列,以便避免使用union。@PavithranRamachandran请查看更新的问题。我已经添加了信息。谢谢你的回答!请您从性能的角度对您的解决方案和所有列进行比较:val taggedUsers=unionDf.groupBy(unionDf.columns.diff(Seq(“tag”)).map(col):\u*).agg(collect\u set(“tag”).alias(“tags”)
?我认为您必须对其进行基准测试。在我的解决方案中,有一个uniqueId生成,groupBy在一个较小的表上生成一个列,该列是数值的,并连接以获得最终结果。在本例中,您将按除tag
列之外的所有列进行分组,然后聚合标记列表。我真的不知道哪种解决方案在实践中会更快。谢谢,我觉得simple join(你的解决方案)会工作得更快。再次感谢@alexanoid:如果数据帧unionDf3
足够小,您可以使用广播连接使其稍微快一点。