Python 使用数组数组分解列-PySpark
我有一列数据如下:Python 使用数组数组分解列-PySpark,python,arrays,apache-spark,pyspark,databricks,Python,Arrays,Apache Spark,Pyspark,Databricks,我有一列数据如下: [[[-77.1082606, 38.935738]] ,Point] column 1 column 2 column 3 -77.1082606 38.935738 Point |-- geometry: struct (nullable = true) | |-- coordinates: string (nullable = false) | |-- type: string (
[[[-77.1082606, 38.935738]] ,Point]
column 1 column 2 column 3
-77.1082606 38.935738 Point
|-- geometry: struct (nullable = true)
| |-- coordinates: string (nullable = false)
| |-- type: string (nullable = false
我希望它像这样分开:
[[[-77.1082606, 38.935738]] ,Point]
column 1 column 2 column 3
-77.1082606 38.935738 Point
|-- geometry: struct (nullable = true)
| |-- coordinates: string (nullable = false)
| |-- type: string (nullable = false
如何使用PySpark或Scala(Databricks 3.0)实现这一点?我知道如何分解列,但不分解这些结构。谢谢
编辑:以下是列的架构:
[[[-77.1082606, 38.935738]] ,Point]
column 1 column 2 column 3
-77.1082606 38.935738 Point
|-- geometry: struct (nullable = true)
| |-- coordinates: string (nullable = false)
| |-- type: string (nullable = false
您可以使用
regexp\u replace()
去掉方括号,然后split()
将结果字符串按逗号分隔成单独的列
from pyspark.sql.functions import regexp_replace, split, col
df.select(regexp_replace(df.geometry.coordinates, "[\[\]]", "").alias("coordinates"),
df.geometry.type.alias("col3")) \
.withColumn("arr", split(col("coordinates"), "\\,")) \
.select(col("arr")[0].alias("col1"),
col("arr")[1].alias("col2"),
"col3") \
.drop("arr") \
.show(truncate = False)
+-----------+----------+-----+
|col1 |col2 |col3 |
+-----------+----------+-----+
|-77.1082606| 38.935738|Point|
+-----------+----------+-----+
什么类型的<代码>数组?请发布
printSchema
的结果,我想不起语法-你更快了:D+1,我建议@AshleyO也给出+1并接受:)我应该更清楚,数据都在一个结构中。我进行了编辑,以便更清楚地显示信息。我正在测试这个概念是否有帮助,所以你有[“[[-77.1082606,38.935738]]”,“Point”]
?正确。一栏