Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/sql-server-2008/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将一个数据帧中的单个值映射到另一个数据帧中的值_Scala_Apache Spark_Dataframe_Spark Dataframe - Fatal编程技术网

Scala 将一个数据帧中的单个值映射到另一个数据帧中的值

Scala 将一个数据帧中的单个值映射到另一个数据帧中的值,scala,apache-spark,dataframe,spark-dataframe,Scala,Apache Spark,Dataframe,Spark Dataframe,我有一个带有两列的数据帧(DF1) +-------+------+ |words |value | +-------+------+ |ABC |1.0 | |XYZ |2.0 | |DEF |3.0 | |GHI |4.0 | +-------+------+ 组合多个数据帧以创建新列需要联接。通过查看您的两个数据帧,我们似乎可以通过df1的words列和df2的string列进行连接,但是string列需要分解并在以后进行组合(这可以通过在分解之前

我有一个带有两列的数据帧(DF1)

+-------+------+ |words |value | +-------+------+ |ABC |1.0 | |XYZ |2.0 | |DEF |3.0 | |GHI |4.0 | +-------+------+ 组合多个数据帧以创建新列需要联接。通过查看您的两个数据帧,我们似乎可以通过
df1
words
列和
df2
string
列进行连接,但是
string
列需要
分解
并在以后进行组合(这可以通过在分解之前为每行提供唯一的ID来完成)
monotically\u递增\u id
df2
中的每一行提供唯一的id<代码>拆分函数将
字符串
列转换为数组进行分解。然后你可以加入他们。接下来的步骤是通过执行
groupBy
和聚合将分解的行合并回原始行

最后,可以使用
udf
函数将收集的数组列更改为所需的字符串列

长话短说,下面的解决方案应该适合您

import org.apache.spark.sql.functions._
def arrayToString = udf((array: Seq[Double])=> array.mkString(" "))

df2.withColumn("rowId", monotonically_increasing_id())
  .withColumn("string", explode(split(col("string"), " ")))
  .join(df1, col("string") === col("words"))
  .groupBy("rowId")
  .agg(collect_list("value").as("stringToDouble"))
  .select(arrayToString(col("stringToDouble")).as("stringToDouble"))
应该给你什么

+--------------+
|stringToDouble|
+--------------+
|1.0 3.0 4.0   |
|2.0 1.0 3.0   |
+--------------+
 def createCorpus(conversationCorpus: Dataset[Row], dataDictionary: Dataset[Row]): Unit = {
 import spark.implicits._

 def getIndex(word: String): Double = {
 val idxRow = dataDictionary.selectExpr("index").where('words.like(word))
 val idx = idxRow.toString
 if (!idx.isEmpty) idx.trim.toDouble else 1.0
 }

 conversationCorpus.map { //eclipse doesnt like this map here.. throws an error..
    r =>
    def row = {
       val arr = r.getString(0).toLowerCase.split(" ")
       val arrList = ArrayBuffer[Double]()
       arr.map {
          str =>
          val index = getIndex(str)
       }
       Row.fromSeq(arrList.toSeq)
       }
       row

   }
 }
import org.apache.spark.sql.functions._
def arrayToString = udf((array: Seq[Double])=> array.mkString(" "))

df2.withColumn("rowId", monotonically_increasing_id())
  .withColumn("string", explode(split(col("string"), " ")))
  .join(df1, col("string") === col("words"))
  .groupBy("rowId")
  .agg(collect_list("value").as("stringToDouble"))
  .select(arrayToString(col("stringToDouble")).as("stringToDouble"))
+--------------+
|stringToDouble|
+--------------+
|1.0 3.0 4.0   |
|2.0 1.0 3.0   |
+--------------+