Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何使用udf更新包含数组的spark dataframe列_Scala_Apache Spark_Apache Spark Sql_User Defined Functions - Fatal编程技术网

Scala 如何使用udf更新包含数组的spark dataframe列

Scala 如何使用udf更新包含数组的spark dataframe列,scala,apache-spark,apache-spark-sql,user-defined-functions,Scala,Apache Spark,Apache Spark Sql,User Defined Functions,我有一个数据帧: +--------------------+------+ |people |person| +--------------------+------+ |[[jack, jill, hero]]|joker | +--------------------+------+ 它的模式是: root |-- people: struct (nullable = true) | |-- person: array (nullable = true

我有一个数据帧:

+--------------------+------+
|people              |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
它的模式是:

root
 |-- people: struct (nullable = true)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)
这里,root--person是一个字符串。因此,我可以使用udf将此字段更新为:

def updateString = udf((s: String) => {
    "Mr. " + s
})
输出:

+---------+
|person   |
+---------+
|Mr. joker|
+---------+
我想对根--people--person列执行相同的操作,该列包含person数组。如何使用udf实现这一点

def updateArray = udf((arr: Seq[Row]) => ???
预期:

+------------------------------+
|people                        |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
编辑:我还想在更新root--people--person之后保留它的模式

预期的人员架构:

df.select("people").printSchema()

root
 |-- people: struct (nullable = false)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

谢谢,

让我们为测试创建数据

scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]

scala> data.printSchema
root
 |-- people: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)

您可能需要进行一些调整(我认为几乎不需要进行任何调整),但这包含了解决问题的大部分内容,因为您只需更新函数,所有内容都保持不变。 下面是代码片段

scala> df2.show
+------+------------------+
|people|            person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]

scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person            |test                       |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.

如果您想了解更多,请告诉我。

这里的问题是,
的结构只有一个字段。在UDF中,需要返回
Tuple1
,然后进一步强制转换UDF的输出,以保持名称正确:

def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))

val newDF = df
  .withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))

newDF.printSchema()
newDF.show()

谢谢@Mahesh,这就像魅力一样有效,但我也想保留它的模式。我已经更新了问题。请查看并更新答案。您的输入是[jack,jill,hero],您希望输出为[hero先生,jack先生,jill先生],对吗?这不是正确的模式(输入数据)
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]

scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people                    |person|dasd                               |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
scala> df2.show
+------+------------------+
|people|            person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]

scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person            |test                       |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))

val newDF = df
  .withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))

newDF.printSchema()
newDF.show()
root
 |-- people: struct (nullable = true)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)


+--------------------+------+
|              people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+