Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字_Scala_Apache Spark_Spark Dataframe - Fatal编程技术网

使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字

使用Spark Scala中另一个数据帧中的字列表替换数据帧中的字,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有两个数据帧,比如Spark Scala中的df1和df2 df1有两个字段“ID”和“Text”,其中“Text”有一些描述(多个单词)。我已经从字段“文本”中删除了所有特殊字符和数字字符,只留下字母和空格 df1样本 +--------------++--------------------+ |ID | | Text | +--------------++--------------------+ |你好吗| |2 | |海登| |3 | | hw是u uma吗| ----------

我有两个数据帧,比如Spark Scala中的df1和df2

df1有两个字段“ID”和“Text”,其中“Text”有一些描述(多个单词)。我已经从字段“文本”中删除了所有特殊字符和数字字符,只留下字母和空格

df1样本

+--------------++--------------------+
|ID | | Text |
+--------------++--------------------+
|你好吗|
|2 | |海登|
|3 | | hw是u uma吗|

--------------------------------------
我将仅对第一个id进行演示,并假设您无法对df2执行收集操作。首先,您需要确保数据帧的模式是df1上文本列的模式和数组

+---+--------------------+
| id|                text|
+---+--------------------+
|  1|[helo, how, are, ...|
+---+--------------------+
对于这样的模式:

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+
之后,可以在文本列上进行分解

res1.withColumn("text", explode(res1("text")))

+---+----+
| id|text|
+---+----+
|  1|helo|
|  1| how|
|  1| are|
|  1| you|
+---+----+
假设您正在替换dataframe,如下所示:

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+
连接两个数据帧将如下所示:

 |-- id: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
+----+-------+
|word|replace|
+----+-------+
|helo|  hello|
| hai|     hi|
+----+-------+
res6.join(res8, res6("text") === res8("word"), "left_outer")

+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
|  1| you|null|   null|
|  1| how|null|   null|
|  1|helo|helo|  hello|
|  1| are|null|   null|
+---+----+----+-------+
使用合并空值执行选择:

res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))

+---+-------------+
| id|replaced_text|
+---+-------------+
|  1|          you|
|  1|          how|
|  1|        hello|
|  1|          are|
+---+-------------+
然后按id分组并聚合到收集列表函数中:

res33.groupBy("id").agg(collect_list("replaced_text"))

+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
|  1|       [you, how, hello,...|
+---+---------------------------+

请记住,您应该保留文本元素的初始顺序。

如果将
df2
转换为映射,则更容易实现这一点。假设它不是一个巨大的表,您可以执行以下操作:

val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
这将为您提供一个
地图
,供您参考:

scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
现在,您可以使用
UDF
创建一个函数,该函数将利用
keyVal
Map替换值:

val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
现在,您可以在数据帧上调用udf
getVal
,以获得所需的结果

df1.withColumn("text" , getVal(df1("text")) ).show


+---+-----------------+
| id|             text|
+---+-----------------+
|  1|hello how are you|
|  2|        hi haiden|
|  3|  how are you uma|
+---+-----------------+

我想下面的代码应该可以解决您的问题

我已经通过使用RDD解决了这个问题

 val wordRdd = df1.rdd.flatMap{ row =>
 val wordList = row.getAs[String]("Text").split(" ").toList
 wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()

val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd =  wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

df2有多大?