Apache spark 从CSV中读取字符串数组作为Pyspark中的数组
我有一个包含如下数据的Apache spark 从CSV中读取字符串数组作为Pyspark中的数组,apache-spark,pyspark,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Dataframes,我有一个包含如下数据的csv文件 ID|Arr_of_Str 1|["ABC DEF"] 2|["PQR", "ABC DEF"] 我想读取此.csv文件,但是当我使用sqlContext.read.load时,它将其读取为字符串 当前: df.printSchema() root |-- ID: integer (nullable = true) |-- Arr_of_Str: string (nullable = true) df.printSchema() root |-
csv
文件
ID|Arr_of_Str
1|["ABC DEF"]
2|["PQR", "ABC DEF"]
我想读取此.csv
文件,但是当我使用sqlContext.read.load
时,它将其读取为字符串
当前:
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: string (nullable = true)
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: array (nullable = true)
|-- element: string (containsNull = true)
期望值:
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: string (nullable = true)
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: array (nullable = true)
|-- element: string (containsNull = true)
如何将字符串强制转换为字符串数组?更新:
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: string (nullable = true)
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: array (nullable = true)
|-- element: string (containsNull = true)
实际上,您可以简单地使用from_json
将Arr_of_Str
列解析为字符串数组:
from pyspark.sql import functions as F
df2 = df.withColumn(
"Arr_of_Str",
F.from_json(F.col("Arr_of_Str"), "array<string>")
)
df1.show(truncate=False)
#+---+--------------+
#|ID |Arr_of_Str |
#+---+--------------+
#| 1 |[ABC DEF] |
#| 2 |[PQR, ABC DEF]|
#+---+--------------+