Scala ApacheSpark列与收集的信息vs行联合_Scala_Apache Spark_Apache Spark Sql

Scala ApacheSpark列与收集的信息vs行联合

scala apache-spark

Scala ApacheSpark列与收集的信息vs行联合,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下数据帧： +------+------------------+--------------+-------------+ | name| email| phone| country| +------+------------------+--------------+-------------+ | Mike| mike@example.com|+91-9999999999| Italy| | Alex| ale

我有以下数据帧：

+------+------------------+--------------+-------------+
|  name|             email|         phone|      country|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
|   Dan|   dan@example.com|+91-9999444999|       Poland|
| Scott| scott@example.com|+91-9111999998|        Spain|
|   Rob|   rob@example.com|+91-9114444998|        Italy|
+------+------------------+--------------+-------------+

+------+------------------+--------------+-------------+-------+
|  name|             email|         phone|      country|    tag|
+------+------------------+--------------+-------------+-------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|   Dan|   dan@example.com|+91-9999444999|       Poland| medium|
| Scott| scott@example.com|+91-9111999998|        Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

应用以下转换后：

val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
    f"FROM $table WHERE $tagCondition"
}

val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType))
  }
}

val unionDf = userTags.foldLeft(userTags.head) {
  case (acc, df) => acc.union(df)
}

我收到以下数据帧：

+------+------------------+--------------+-------------+
|  name|             email|         phone|      country|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
|   Dan|   dan@example.com|+91-9999444999|       Poland|
| Scott| scott@example.com|+91-9111999998|        Spain|
|   Rob|   rob@example.com|+91-9114444998|        Italy|
+------+------------------+--------------+-------------+

+------+------------------+--------------+-------------+-------+
|  name|             email|         phone|      country|    tag|
+------+------------------+--------------+-------------+-------+
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|  Mike|  mike@example.com|+91-9999999999|        Italy|    big|
|  Alex|  alex@example.com|+91-9999999998|       France|    big|
|  John|  john@example.com| +1-1111111111|United States|    big|
|Donald|donald@example.com| +1-2222222222|United States|    big|
| Scott| scott@example.com|+91-9111999998|        Spain|    big|
|   Rob|   rob@example.com|+91-9114444998|        Italy|    big|
|   Dan|   dan@example.com|+91-9999444999|       Poland| medium|
| Scott| scott@example.com|+91-9111999998|        Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

它使用标记列中的附加信息复制每个原始数据帧记录，但我需要类似的内容（不是原始数据帧中的重复记录和

标记列中的标记集合）：
现在我不知道如何更改我的转换，以便接收这样一个带有tag
列（如ArrayType
）的结构，而不需要原始行复制
 这里有一种不改变太多逻辑的可能方法
首先，您必须为users表分配一个唯一的id。如下图所示：
import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

现在，您的标签
和用户标签
仍与上面相同
val tags = Map(
  "big" -> "country IN (FROM big_countries)",
  "medium" -> "country IN (FROM medium_countries)",
  // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
  "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
  f"FROM $table WHERE $tagCondition"
}

这里我们只在UniqueID
和tag
列中进行选择
val userTags = tags.map {
  case (tag, tagCondition) => {
    spark.sql(buildTagQuery(tag, tagCondition, "users"))
      .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
  }
}

这是非常重要的。使用foldLeft的原始代码中有一个微妙的错误。在你的情况下，名单的头被折叠了两次。我在这里所做的是将head选择到一个单独的变量中，然后将其从userTags
中删除。折叠逻辑与之前相同。但在这种情况下，我们不会折叠头部元素两次
val headhere = userTags.head
val userResults  = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) {
  case (acc, df) => acc.union(df)
}

现在我们通过UniqueID
列进行分组，同时将标记聚合到自己的列表中
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

最后，我们将您的users
表与前面分配的UniqueID（即tableuserswithId
）和前面的数据帧连接起来，以获得最终结果
val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

最终结果如下：
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID|  name|             email|         phone|      country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
|       0|  Alex|  alex@example.com|+91-9999999998|       France|            [big]|
|       1|  John|  john@example.com| +1-1111111111|United States|            [big]|
|       2|Donald|donald@example.com| +1-2222222222|United States|   [big, sometag]|
|       4| Scott| scott@example.com|+91-9111999998|        Spain|    [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

您可以发布代码buildTagQuery（tag，tagCondition，“taged_users”）
，也许我们可以创建一个UDF
，生成tag
列，通过它我们可以避免数据帧上的联合、映射和折叠。或者在sql
查询中添加所有列，以便避免使用union。@PavithranRamachandran请查看更新的问题。我已经添加了信息。谢谢你的回答！请您从性能的角度对您的解决方案和所有列进行比较：val taggedUsers=unionDf.groupBy（unionDf.columns.diff（Seq（“tag”））.map（col）：\u*）.agg（collect\u set（“tag”）.alias（“tags”）
？我认为您必须对其进行基准测试。在我的解决方案中，有一个uniqueId生成，groupBy在一个较小的表上生成一个列，该列是数值的，并连接以获得最终结果。在本例中，您将按除tag
列之外的所有列进行分组，然后聚合标记列表。我真的不知道哪种解决方案在实践中会更快。谢谢，我觉得simple join（你的解决方案）会工作得更快。再次感谢@alexanoid：如果数据帧unionDf3
足够小，您可以使用广播连接使其稍微快一点。