Python 将字符串强制转换为ArrayType（DoubleType）pyspark数据帧_Python_Arrays_Dataframe_Apache Spark_Casting

Python 将字符串强制转换为ArrayType（DoubleType）pyspark数据帧

python arrays dataframe apache-spark

Python 将字符串强制转换为ArrayType（DoubleType）pyspark数据帧,python,arrays,dataframe,apache-spark,casting,Python,Arrays,Dataframe,Apache Spark,Casting,我在spark中有一个具有以下模式的数据帧：模式：列活动是一个字符串，示例内容： {1.33,0.567,1.897,0,0.78} 我需要将列活动强制转换为ArrayType（DoubleType）为了完成这一点，我运行了以下命令： df = df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType()))) 数据帧的新架构相应地发生了更改： StructType(List(Stru

我在spark中有一个具有以下模式的数据帧：模式：

列活动是一个字符串，示例内容：

{1.33,0.567,1.897,0,0.78}

我需要将列活动强制转换为ArrayType（DoubleType）

为了完成这一点，我运行了以下命令：

df = df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

数据帧的新架构相应地发生了更改：

StructType(List(StructField(id,StringType,true),
StructField(daily_id,StringType,true),
StructField(activity,ArrayType(DoubleType,true),true)))

但是，现在的数据如下所示： [NULL，0.567,1.897,0，NULL]

它将字符串数组的第一个和最后一个元素更改为NULL。我不明白Spark为什么要用数据帧做这件事

你能帮我解决什么问题吗

非常感谢，因为

下面的代码未替换

df.withColumn("activity",F.split(F.col("activity"),",\s*")).show(truncate=False)
+-------------------------------+
|activity                       |
+-------------------------------+
|[{1.33, 0.567, 1.897, 0, 0.78}]|
+-------------------------------+

当您尝试将这些

{1.33

0.78}

字符串值强制转换为

DoubleType

时，您将得到

null

作为输出

df.withColumn("activity",F.split(F.col("activity"),",\s*").cast(ArrayType(DoubleType()))).show(truncate=False)
+----------------------+
|activity              |
+----------------------+
|[, 0.567, 1.897, 0.0,]|
+----------------------+

改变这个

df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

到

从pyspark.sql导入函数为F
从pyspark.sql.types导入ArrayType
从pyspark.sql.types导入DoubleType
df.select（F.split（F.regexp_replace（F.col（“活动”），“[{}]，”，“，”）.cast（“数组”）.别名（“活动”））

发生这种情况是因为您的第一个和最后一个字母是括号本身，因此将其转换为null


testdf.withColumn('activity',f.split(f.col('activity').substr(f.lit(2),f.length(f.col('activity'))-2),',').cast(t.ArrayType(t.DoubleType()))).show(2, False)

试试这个-

val-df=Seq（“{1.33,0.567,1.897,0,0.78}”）.toDF（“活动”）
df.show（假）
df.printSchema（）
/**
* +-------------------------+
*|活动|
* +-------------------------+
* |{1.33,0.567,1.897,0,0.78}|
* +-------------------------+
*
*根
*|--activity:string（nullable=true）
*/
val processedDF=df.withColumn（“活动”，
拆分（regexp_replace（$“活动”、“[^0-9.，]”、“）、“，”）.cast（“数组”））
processedDF.show（false）
processedDF.printSchema（）
/**
* +-------------------------------+
*|活动|
* +-------------------------------+
* |[1.33, 0.567, 1.897, 0.0, 0.78]|
* +-------------------------------+
*
*根
*|--活动：数组（nullable=true）
*| |--元素：双精度（containsnall=true）
*/

使用Spark SQL的简单方法（无正则表达式）：

df2=(df1
     .withColumn('col1',expr("""
     transform(
     split(
     substring(activity,2,length(activity)-2),','),
     x->DOUBLE(x))
     """))
    )

这回答了你的问题吗？


testdf.withColumn('activity',f.split(f.col('activity').substr(f.lit(2),f.length(f.col('activity'))-2),',').cast(t.ArrayType(t.DoubleType()))).show(2, False)

df2=(df1
     .withColumn('col1',expr("""
     transform(
     split(
     substring(activity,2,length(activity)-2),','),
     x->DOUBLE(x))
     """))
    )