使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字
我有两个数据帧,比如Spark Scala中的df1和df2 df1有两个字段“ID”和“Text”,其中“Text”有一些描述(多个单词)。我已经从字段“文本”中删除了所有特殊字符和数字字符,只留下字母和空格 df1样本使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有两个数据帧,比如Spark Scala中的df1和df2 df1有两个字段“ID”和“Text”,其中“Text”有一些描述(多个单词)。我已经从字段“文本”中删除了所有特殊字符和数字字符,只留下字母和空格 df1样本 +--------------++--------------------+ |ID | | Text | +--------------++--------------------+ |你好吗| |2 | |海登| |3 | | hw是u uma吗| ----------
+--------------++--------------------+
|ID | | Text |
+--------------++--------------------+
|你好吗|
|2 | |海登|
|3 | | hw是u uma吗|
--------------------------------------
我将仅对第一个id进行演示,并假设您无法对df2执行收集操作。首先,您需要确保数据帧的模式是df1上文本列的模式和数组
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
对于这样的模式:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
之后,可以在文本列上进行分解
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
假设您正在替换dataframe,如下所示:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
连接两个数据帧将如下所示:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
使用合并空值执行选择:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
然后按id分组并聚合到收集列表函数中:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
请记住,您应该保留文本元素的初始顺序。如果将
df2
转换为映射,则更容易实现这一点。假设它不是一个巨大的表,您可以执行以下操作:
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
这将为您提供一个地图
,供您参考:
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
现在,您可以使用UDF
创建一个函数,该函数将利用keyVal
Map替换值:
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
现在,您可以在数据帧上调用udfgetVal
,以获得所需的结果
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
我想下面的代码应该可以解决您的问题 我已经通过使用RDD解决了这个问题
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))
df2有多大?