Scala ApacheSpark列与收集的信息vs行联合

Scala ApacheSpark列与收集的信息vs行联合,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下数据帧: +------+------------------+--------------+-------------+ | name| email| phone| country| +------+------------------+--------------+-------------+ | Mike| mike@example.com|+91-9999999999| Italy| | Alex| ale

我有以下数据帧:

+------+------------------+--------------+-------------+
|  name|             email|         phone|      country|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
|   Dan|   dan@example.com|+91-9999444999|       Poland|
| Scott| scott@example.com|+91-9111999998|        Spain|
|   Rob|   rob@example.com|+91-9114444998|        Italy|
+------+------------------+--------------+-------------+
+------+------------------+--------------+-------------+-------+
|  name|             email|         phone|      country|    tag|
+------+------------------+--------------+-------------+-------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|   Dan|   dan@example.com|+91-9999444999|       Poland| medium|
| Scott| scott@example.com|+91-9111999998|        Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
应用以下转换后:

val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
    f"FROM $table WHERE $tagCondition"
}

val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType))
  }
}

val unionDf = userTags.foldLeft(userTags.head) {
  case (acc, df) => acc.union(df)
}
我收到以下数据帧:

+------+------------------+--------------+-------------+
|  name|             email|         phone|      country|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
|   Dan|   dan@example.com|+91-9999444999|       Poland|
| Scott| scott@example.com|+91-9111999998|        Spain|
|   Rob|   rob@example.com|+91-9114444998|        Italy|
+------+------------------+--------------+-------------+
+------+------------------+--------------+-------------+-------+
|  name|             email|         phone|      country|    tag|
+------+------------------+--------------+-------------+-------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|   Dan|   dan@example.com|+91-9999444999|       Poland| medium|
| Scott| scott@example.com|+91-9111999998|        Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
它使用标记列中的附加信息复制每个原始数据帧记录,但我需要类似的内容(不是原始数据帧中的重复记录和
标记列中的标记集合):


现在我不知道如何更改我的转换,以便接收这样一个带有
tag
列(如
ArrayType
)的结构,而不需要原始行复制

这里有一种不改变太多逻辑的可能方法

首先,您必须为users表分配一个唯一的id。如下图所示:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")
现在,您的
标签
用户标签
仍与上面相同

val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
  f"FROM $table WHERE $tagCondition"
}
这里我们只在
UniqueID
tag
列中进行选择

val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
  }
}
这是非常重要的。使用foldLeft的原始代码中有一个微妙的错误。在你的情况下,名单的头被折叠了两次。我在这里所做的是将head选择到一个单独的变量中,然后将其从
userTags
中删除。折叠逻辑与之前相同。但在这种情况下,我们不会折叠头部元素两次

val headhere = userTags.head
val userResults  = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) {
  case (acc, df) => acc.union(df)
}
现在我们通过
UniqueID
列进行分组,同时将
标记聚合到自己的列表中

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)
最后,我们将您的
users
表与前面分配的UniqueID(即table
userswithId
)和前面的数据帧连接起来,以获得最终结果

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)
最终结果如下:

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID|  name|             email|         phone|      country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
|       0|  Alex|  alex@example.com|+91-9999999998|       France|            [big]|
|       1|  John|  john@example.com| +1-1111111111|United States|            [big]|
|       2|Donald|donald@example.com| +1-2222222222|United States|   [big, sometag]|
|       4| Scott| scott@example.com|+91-9111999998|        Spain|    [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

您可以发布代码
buildTagQuery(tag,tagCondition,“taged_users”)
,也许我们可以创建一个
UDF
,生成
tag
列,通过它我们可以避免数据帧上的联合、映射和折叠。或者在
sql
查询中添加所有列,以便避免使用union。@PavithranRamachandran请查看更新的问题。我已经添加了信息。谢谢你的回答!请您从性能的角度对您的解决方案和所有列进行比较:
val taggedUsers=unionDf.groupBy(unionDf.columns.diff(Seq(“tag”)).map(col):\u*).agg(collect\u set(“tag”).alias(“tags”)
?我认为您必须对其进行基准测试。在我的解决方案中,有一个uniqueId生成,groupBy在一个较小的表上生成一个列,该列是数值的,并连接以获得最终结果。在本例中,您将按除
tag
列之外的所有列进行分组,然后聚合标记列表。我真的不知道哪种解决方案在实践中会更快。谢谢,我觉得simple join(你的解决方案)会工作得更快。再次感谢@alexanoid:如果数据帧
unionDf3
足够小,您可以使用广播连接使其稍微快一点。