更改数组值pyspark
我有一个pyspark数据帧: 示例df:更改数组值pyspark,pyspark,Pyspark,我有一个pyspark数据帧: 示例df: number | matricule<array> | name<array> | ---------------------------------------------- AA | [] | [7] | ---------------------------------------------- AA | [9]
number | matricule<array> | name<array> |
----------------------------------------------
AA | [] | [7] |
----------------------------------------------
AA | [9] | [] |
----------------------------------------------
AA | [""] | [2] |
----------------------------------------------
AA | [2] | [""] |
但我有一个错误:
AnalysisException: u"cannot resolve, `matricule` = '[]')' due to data type mismatch: differing types.
预期结果:
number | matricule<array> | name<array> |
----------------------------------------------
AA | [] | [7] |
----------------------------------------------
AA | [9] | [] |
----------------------------------------------
AA | [] | [2] |
----------------------------------------------
AA | [2] | [] |
number |矩阵| name |
----------------------------------------------
AA |[]|[7]|
----------------------------------------------
AA |[9]|[]|
----------------------------------------------
AA |[]|[2]|
----------------------------------------------
AA |[2]|[]|
请有人能帮我吗?
谢谢数据帧:
+------+---------+----+
|Number|Matricule|Name|
+------+---------+----+
| AA| [""]| [7]|
| AA| [9]| []|
| AA| [""]| [2]|
| AA| [2]|[""]|
+------+---------+----+
从两列中筛选出”:
df.withColumn("Matricule", F.expr("""filter(Matricule, x -> x!= '""')"""))\
.withColumn("Name", F.expr("""filter(Name, x -> x!= '""')""")).show()
+------+---------+----+
|Number|Matricule|Name|
+------+---------+----+
| AA| []| [7]|
| AA| [9]| []|
| AA| []| [2]|
| AA| [2]| []|
+------+---------+----+
如评论中所述,您还可以使用数组\u remove:
df.withColumn("Matricule", F.array_remove("Matricule", '""'))\
.withColumn("Name", F.array_remove("Name", '""')).show()
+------+---------+----+
|Number|Matricule|Name|
+------+---------+----+
| AA| []| [7]|
| AA| [9]| []|
| AA| []| [2]|
| AA| [2]| []|
+------+---------+----+
您想将空字符串转换为空字符串还是将其从数组中完全删除?@blackishop删除它们并保留一个空数组[]如果您使用的是Spark 2.4+,您可以这样使用:
df=df.withColumn(“matricule_2”,array_remove(col(“matricule”),“”)
。。。
df.withColumn("Matricule", F.array_remove("Matricule", '""'))\
.withColumn("Name", F.array_remove("Name", '""')).show()
+------+---------+----+
|Number|Matricule|Name|
+------+---------+----+
| AA| []| [7]|
| AA| [9]| []|
| AA| []| [2]|
| AA| [2]| []|
+------+---------+----+