使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字_Scala_Apache Spark_Spark Dataframe

使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字

scala apache-spark

使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有两个数据帧，比如Spark Scala中的df1和df2 df1有两个字段“ID”和“Text”，其中“Text”有一些描述（多个单词）。我已经从字段“文本”中删除了所有特殊字符和数字字符，只留下字母和空格 df1样本 +--------------++--------------------+ |ID | | Text | +--------------++--------------------+ |你好吗| |2 | |海登| |3 | | hw是u uma吗| ----------

我有两个数据帧，比如Spark Scala中的df1和df2

df1有两个字段“ID”和“Text”，其中“Text”有一些描述（多个单词）。我已经从字段“文本”中删除了所有特殊字符和数字字符，只留下字母和空格

df1样本

+--------------++--------------------+
|ID | | Text |
+--------------++--------------------+
|你好吗|
|2 | |海登|
|3 | | hw是u uma吗|
--------------------------------------

我将仅对第一个id进行演示，并假设您无法对df2执行收集操作。首先，您需要确保数据帧的模式是df1上文本列的模式和数组

+---+--------------------+
| id|                text|
+---+--------------------+
|  1|[helo, how, are, ...|
+---+--------------------+

对于这样的模式：

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+

res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+

之后，可以在文本列上进行分解

res1.withColumn("text", explode(res1("text")))

+---+----+
| id|text|
+---+----+
|  1|helo|
|  1| how|
|  1| are|
|  1| you|
+---+----+

假设您正在替换dataframe，如下所示：

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+

res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+

连接两个数据帧将如下所示：

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+

res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+

使用合并空值执行选择：

res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))

+---+-------------+
| id|replaced_text|
+---+-------------+
|  1|          you|
|  1|          how|
|  1|        hello|
|  1|          are|
+---+-------------+

然后按id分组并聚合到收集列表函数中：

res33.groupBy("id").agg(collect_list("replaced_text"))

+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
|  1|       [you, how, hello,...|
+---+---------------------------+

请记住，您应该保留文本元素的初始顺序。

如果将

df2

转换为映射，则更容易实现这一点。假设它不是一个巨大的表，您可以执行以下操作：

val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap

这将为您提供一个

地图

，供您参考：

scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)

现在，您可以使用

UDF

创建一个函数，该函数将利用

keyVal

Map替换值：

val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )

现在，您可以在数据帧上调用udf

getVal

，以获得所需的结果

df1.withColumn("text" , getVal(df1("text")) ).show


+---+-----------------+
| id|             text|
+---+-----------------+
|  1|hello how are you|
|  2|        hi haiden|
|  3|  how are you uma|
+---+-----------------+

我想下面的代码应该可以解决您的问题

我已经通过使用RDD解决了这个问题

 val wordRdd = df1.rdd.flatMap{ row =>
 val wordList = row.getAs[String]("Text").split(" ").toList
 wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()

val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd =  wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

df2有多大？