Apache spark 无法在Spark Dataframe中将列拆分为多个列_Apache Spark_Multidimensional Array_Apache Spark Sql_User Defined Functions

Apache spark 无法在Spark Dataframe中将列拆分为多个列

apache-spark

Apache spark 无法在Spark Dataframe中将列拆分为多个列,apache-spark,multidimensional-array,apache-spark-sql,user-defined-functions,Apache Spark,Multidimensional Array,Apache Spark Sql,User Defined Functions,无法在Spark数据帧中通过RDD将列拆分为多列。我尝试了其他一些代码，但只适用于固定列。例：数据类型是name:string，city=list（string）我有一个文本文件和输入数据如下 Name, city A, (hyd,che,pune) B, (che,bang,del) C, (hyd) 所需输出为： A,hyd A,che A,pune B,che, C,bang B,del C,hyd 读取文本文件并转换DF后数据框如下所示 scala>

无法在Spark数据帧中通过RDD将列拆分为多列。我尝试了其他一些代码，但只适用于固定列。例：

数据类型是

name:string，city=list（string）

我有一个文本文件和输入数据如下

Name, city

A, (hyd,che,pune)

B, (che,bang,del)

C, (hyd)

所需输出为：

A,hyd 

A,che

A,pune

B,che,

C,bang

B,del

C,hyd

读取文本文件并转换DF后

数据框如下所示

scala> data.show
+----------------+
|                 |
|           value |
|                 |
+----------------+

|      Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
|         C,(hyd)
|
|  D,(hyd,che,tn)|
+----------------+

您可以在数据帧上使用

explode

函数

val explodeDF = inputDF.withColumn("city", explode($"city")).show()

现在我知道您正在以字符串形式加载整行，下面是如何实现输出的解决方案

我定义了两个用户定义的函数

val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities

import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)


val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
  .select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
  .withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
  .drop($"city_list")
  .withColumn("city", explode($"city_array"))
  .drop($"city_array")

outputDF.show()

希望这有帮助

可能重复试试这个不，这里的数据集是数组而不是元组。你的数据中的列名正确吗？如何在scala dataframe中获得元组。你可以获得spark结构或数组。你可以提供dataframe的架构。我已经尝试过了，它不起作用。这里的数据是一个数组。当我们将这些数据作为文本文件读入spark时，它如下所示：scala>data.show+----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------++-----------名称，城市| a，（hyd，che，pune）| B，（che，bang，del）| C，（hyd）| D，（hyd，che，tn）124+-----------+您能运行

data.printSchema（）吗

并提供结果？scala>data.printSchema root |--name:string（nullable=true）好的，这意味着您正在将整行作为字符串加载，并且从您的帖子

中，数据类型是name:string，city=list（string）

我认为city是另一列，正如您所提到的数组或列表。您需要解析字符串（在您的案例名称中）以拥有两列，即name、city（应该是Array类型），然后只有您可以尝试explode@user11789810，我已经更新了我的答案，请检查