Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark中的查找表_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe_User Defined Functions - Fatal编程技术网

Scala Spark中的查找表

Scala Spark中的查找表,scala,apache-spark,apache-spark-sql,spark-dataframe,user-defined-functions,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,User Defined Functions,Spark中有一个数据帧,没有明确定义的模式,我想将其用作查找表。例如,下面的数据框: +------------------------------------------------------------------------+ |lookupcolumn | +-------------------------------------------------------

Spark中有一个数据帧,没有明确定义的模式,我想将其用作查找表。例如,下面的数据框:

+------------------------------------------------------------------------+
|lookupcolumn                                                            |
+------------------------------------------------------------------------+
|[val1,val2,val3,val4,val5,val6]                                         |
+------------------------------------------------------------------------+
架构如下所示:

 |-- lookupcolumn: struct (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- key3: string (nullable = true)
 |    |-- key4: string (nullable = true)
 |    |-- key5: string (nullable = true)
 |    |-- key6: string (nullable = true)
val get_val = udf((keyindex: String) => {
    val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
    res.head.toString
})
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
我说的是“模式没有明确定义”,因为在读取数据时,键的数量是未知的,所以我把它留给Spark来推断模式

现在,如果我有另一个dataframe,其列如下:

+-----------------+
|       datacolumn|
+-----------------+
|         key1    |
|         key3    |
|         key5    |
|         key2    |
|         key4    |
+-----------------+
我希望结果是:

+-----------------+
|     resultcolumn|
+-----------------+
|         val1    |
|         val3    |
|         val5    |
|         val2    |
|         val4    |
+-----------------+
我尝试了一个像这样的
UDF

 |-- lookupcolumn: struct (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- key3: string (nullable = true)
 |    |-- key4: string (nullable = true)
 |    |-- key5: string (nullable = true)
 |    |-- key6: string (nullable = true)
val get_val = udf((keyindex: String) => {
    val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
    res.head.toString
})
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
但是它抛出了一个空指针异常错误


有人能告诉我UDF有什么问题吗?在Spark中是否有更好/更简单的查找方法?

我假设查找表非常小,在这种情况下,将其收集到驱动程序并将其转换为正常的
映射将更有意义。然后在
UDF
功能中使用此
Map
。可以通过多种方式完成,例如:

 |-- lookupcolumn: struct (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- key3: string (nullable = true)
 |    |-- key4: string (nullable = true)
 |    |-- key5: string (nullable = true)
 |    |-- key6: string (nullable = true)
val get_val = udf((keyindex: String) => {
    val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
    res.head.toString
})
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
使用上述
lookup\u map
变量,
UDF
将简单地为:

val lookup = udf((key: String) => lookup_map.get(key))
最终数据帧可通过以下方式获得:

val df2 = df.withColumn("resultcolumn", lookup($"datacolumn"))

您的查找数据帧是只有一行还是多行?它只有一行。我想如果我可以把它分解成多行,在不同的列中使用键和值,那么我就可以进行连接了,但是我不知道怎么做。嗯,不。resultcolumn有值,datacolumn有键。谢谢,这很有效。但是,当键不在表中时,有没有办法让UDF返回null?当前它抛出了一个错误。@PramodKumar:是的,这是可能的。我稍微更改了udf,现在当密钥不存在时,它应该返回null。还可以通过将
get()
更改为
getOrElse()
来返回默认值。