Python 使用数组数组分解列-PySpark_Python_Arrays_Apache Spark_Pyspark_Databricks

Python 使用数组数组分解列-PySpark

python arrays apache-spark pyspark

Python 使用数组数组分解列-PySpark,python,arrays,apache-spark,pyspark,databricks,Python,Arrays,Apache Spark,Pyspark,Databricks,我有一列数据如下： [[[-77.1082606, 38.935738]] ,Point] column 1 column 2 column 3 -77.1082606 38.935738 Point |-- geometry: struct (nullable = true) | |-- coordinates: string (nullable = false) | |-- type: string (

我有一列数据如下：

[[[-77.1082606, 38.935738]] ,Point]

  column 1          column 2        column 3
 -77.1082606      38.935738           Point

|-- geometry: struct (nullable = true)
 |    |-- coordinates: string (nullable = false)
 |    |-- type: string (nullable = false

我希望它像这样分开：

[[[-77.1082606, 38.935738]] ,Point]

  column 1          column 2        column 3
 -77.1082606      38.935738           Point

|-- geometry: struct (nullable = true)
 |    |-- coordinates: string (nullable = false)
 |    |-- type: string (nullable = false

如何使用PySpark或Scala（Databricks 3.0）实现这一点？我知道如何分解列，但不分解这些结构。谢谢

编辑：以下是列的架构：

[[[-77.1082606, 38.935738]] ,Point]

  column 1          column 2        column 3
 -77.1082606      38.935738           Point

|-- geometry: struct (nullable = true)
 |    |-- coordinates: string (nullable = false)
 |    |-- type: string (nullable = false

您可以使用

regexp\u replace（）

去掉方括号，然后

split（）

将结果字符串按逗号分隔成单独的列

from pyspark.sql.functions import regexp_replace, split, col

df.select(regexp_replace(df.geometry.coordinates, "[\[\]]", "").alias("coordinates"),
          df.geometry.type.alias("col3")) \
  .withColumn("arr", split(col("coordinates"), "\\,")) \
  .select(col("arr")[0].alias("col1"),
          col("arr")[1].alias("col2"),
         "col3") \
  .drop("arr") \
  .show(truncate = False)
+-----------+----------+-----+
|col1       |col2      |col3 |
+-----------+----------+-----+
|-77.1082606| 38.935738|Point|
+-----------+----------+-----+

什么类型的<代码>数组？请发布

printSchema

的结果，我想不起语法-你更快了：D+1，我建议@AshleyO也给出+1并接受：）我应该更清楚，数据都在一个结构中。我进行了编辑，以便更清楚地显示信息。我正在测试这个概念是否有帮助，所以你有

[“[[-77.1082606，38.935738]]”，“Point”]

？正确。一栏