Apache spark spark posexplode在列中失败_Apache Spark_Apache Spark Sql

Apache spark spark posexplode在列中失败

apache-spark

Apache spark spark posexplode在列中失败,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,如何在sparkswith column语句中使用posexplode Seq(Array(1,2,3)).toDF.select(col("*"), posexplode(col("value")) as Seq("position", "value")).show 工作正常，但： Seq(Array(1,2,3)).toDF.withColumn("foo", posexplode(col("value"))).show 在以下情况下失败： org.apache.spark.sql.An

如何在sparks

with column

语句中使用posexplode

Seq(Array(1,2,3)).toDF.select(col("*"), posexplode(col("value")) as Seq("position", "value")).show

工作正常，但：

Seq(Array(1,2,3)).toDF.withColumn("foo", posexplode(col("value"))).show

在以下情况下失败：

org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got foo ;

不确定这是否真的是您想要的，但您可以尝试使用select语句而不是withColumn，如

df.select('col1', 'col2', F.posexplode('col_to_be_exploded'))

withColumn

函数似乎不适用于

posexplode

。您可以使用类似以下内容：

df.select($"*", posexplode($"value").as(List("index", "column")))

您可以选择数据框中的所有列并附加

posexplode（）

的结果，而不是使用

withColumn（）

，包括

pos

和

col

字段的别名。下面是一个使用PySpark的示例

from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

df = spark.createDataFrame(
    [(["a"], ), (["b", "c"], ), (["d", "e", "f"], )],
    ["A"],
)

df.show()
# +---------+
# |        A|
# +---------+
# |      [a]|
# |   [b, c]|
# |[d, e, f]|
# +---------+

df = df.select("*", F.posexplode("A").alias("B", "C"))
df.show()
# +---------+---+---+
# |        A|  B|  C|
# +---------+---+---+
# |      [a]|  0|  a|
# |   [b, c]|  0|  b|
# |   [b, c]|  1|  c|
# |[d, e, f]|  0|  d|
# |[d, e, f]|  1|  e|
# |[d, e, f]|  2|  f|
# +---------+---+---+

您缺少一个

位置

别名..您只指定了

值

。但是，

def posexplode（e:Column）

SQL DSL的函数签名不允许添加第二列！这是SQL DSL中的一个bug吗？注意，我使用的是spark 2.2.2，它与

posexplode

签名无关

withColumn

只是被设计为只与创建单个列的函数一起工作，这里显然不是这种情况。Ok。因此，如果我尝试使用动态构造在SELECT中调用它，比如：

grouped.SELECT（df.filter（！\uequals（“”）.map（col.）：\u*），posExplode（“”）也会失败，因为在操作符之前或之后不允许使用其他内容。而且类型（col
）不匹配，因此我无法连接序列。这个问题由JIRA票证跟踪：，最后的注释提供了一个解决方法：df selectExpr（“*”，“posexplode（s）as（p，c）”）drop（“s”）
事实上，我目前正在使用类似的解决方法。