Apache spark 从CSV中读取字符串数组作为Pyspark中的数组

Apache spark 从CSV中读取字符串数组作为Pyspark中的数组,apache-spark,pyspark,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Dataframes,我有一个包含如下数据的csv文件 ID|Arr_of_Str 1|["ABC DEF"] 2|["PQR", "ABC DEF"] 我想读取此.csv文件,但是当我使用sqlContext.read.load时,它将其读取为字符串 当前: df.printSchema() root |-- ID: integer (nullable = true) |-- Arr_of_Str: string (nullable = true) df.printSchema() root |-

我有一个包含如下数据的
csv
文件

ID|Arr_of_Str
 1|["ABC DEF"]
 2|["PQR", "ABC DEF"]
我想读取此
.csv
文件,但是当我使用
sqlContext.read.load
时,它将其读取为字符串

当前:

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: string (nullable = true)
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
      |-- element: string (containsNull = true)

期望值:

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: string (nullable = true)
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
      |-- element: string (containsNull = true)

如何将字符串强制转换为字符串数组?

更新:

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: string (nullable = true)
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
      |-- element: string (containsNull = true)
实际上,您可以简单地使用
from_json
Arr_of_Str
列解析为字符串数组:

from pyspark.sql import functions as F

df2 = df.withColumn(
    "Arr_of_Str",
    F.from_json(F.col("Arr_of_Str"), "array<string>")
)

df1.show(truncate=False)

#+---+--------------+
#|ID |Arr_of_Str    |
#+---+--------------+
#| 1 |[ABC DEF]     |
#| 2 |[PQR, ABC DEF]|
#+---+--------------+