Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/heroku/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在不使用UDF的情况下将字符串列附加到Scala Spark中的数组字符串列?_Scala_Apache Spark - Fatal编程技术网

如何在不使用UDF的情况下将字符串列附加到Scala Spark中的数组字符串列?

如何在不使用UDF的情况下将字符串列附加到Scala Spark中的数组字符串列?,scala,apache-spark,Scala,Apache Spark,我有一个表,它有一个包含如下数组的列- Student_ID | Subject_List | New_Subject 1 | [Mat, Phy, Eng] | Chem 我想将新主题附加到主题列表中,并获取新列表 创建数据帧- val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subje

我有一个表,它有一个包含如下数组的列-

Student_ID | Subject_List        | New_Subject
1          | [Mat, Phy, Eng]     | Chem
我想将新主题附加到主题列表中,并获取新列表

创建数据帧-

val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
我已经用UDF做了如下尝试-

def append_list = (arr: Seq[String], s: String) => {
    arr :+ s
  }

val append_list_UDF = udf(append_list)

val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
使用UDF,我可以获得所需的输出

Student_ID | Subject_List        | New_Subject | New_List
1          | [Mat, Phy, Eng]     | Chem        | [Mat, Phy, Eng, Chem]

没有udf我们能做到吗?谢谢。

在Spark 2.4或更高版本中,
数组
concat
的组合应该可以做到这一点

import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column

def append(arr: Column, col: Column) = concat(arr, array(col))

df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+-------------+-------------+-------------+-------------+-------------+
|学生证|科目|新科目|新科目|新科目|
+----------+---------------+-----------+--------------------+
|1 |[Mat,Phy,Eng]| Chem |[Mat,Phy,Eng,C|
+----------+---------------+-----------+--------------------+

但是,我不希望在这里有显著的性能提升。

但是,您使用的是udf,我的问题是没有udf我们能做到吗?这里没有定义udf。
append
只是一个别名,您可以不用它:
df.withColumn(“New\u List”、concat($“Subject\u List”、array($“New\u Subject”))
@sachav我尝试了你的方法,但它给出了这个错误-线程“main”org.apache.spark.sql.AnalysisException中的异常:由于数据类型不匹配,无法解析“concat(Subject\u List,array(New\u Subject)):函数concat的输入应该是StringType或BinaryType,但它是[array,array];@GouherDanish这表示您使用的是过时的Spark版本(2.3或更早版本)。在这种情况下,udf是唯一的选项。是的,您是对的,我使用的是较早的Spark版本。感谢您的解决方案,我们会记住这一点。我会批准它。
 val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
  (2, Array("Hindi", "Bio", "Eng"), "IoT"),
  (3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
  .union(df.select($"Student_ID",$"New_Subject"))
  .groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]