Apache spark 删除Spark数据帧-PySpark中的空白时出错

Apache spark 删除Spark数据帧-PySpark中的空白时出错,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在将csv文件读入spark数据帧。csv在许多列中都有空格“”,我想删除这些空格。csv中有500列,因此我无法在代码中手动指定特定列 样本数据: ADVANCE_TYPE CHNG_DT BU_IN A 20190718 1 20190728 2 20190714 B 20190705 201

我正在将csv文件读入spark数据帧。csv在许多列中都有空格“”,我想删除这些空格。csv中有500列,因此我无法在代码中手动指定特定列

样本数据:

  ADVANCE_TYPE  CHNG_DT    BU_IN
     A          20190718    1
                20190728    2 
                20190714     
     B          20190705     
                20190724    4 
代码:


但这些代码并没有删除空的空格。请帮忙

您可以使用列表理解对所有需要的列应用trim

示例:

df=spark.createDataFrame([("   ","12343","   ","9  ","   0")])

#finding length of each column
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]

df.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#|       3|       5|       3|       3|       4|
#+--------+--------+--------+--------+--------+

#trim on all the df columns
expr=[trim(col(col_name)).name(col_name) for col_name in df.columns]

df1=df.select(expr)
df1.show()
#+---+-----+---+---+---+
#| _1|   _2| _3| _4| _5|
#+---+-----+---+---+---+
#|   |12343|   |  9|  0|
#+---+-----+---+---+---+

#length on df1 columns
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df1.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#|       0|       5|       0|       1|       1|
#+--------+--------+--------+--------+--------+

您可以使用列表理解对所有必需的列应用trim

示例:

df=spark.createDataFrame([("   ","12343","   ","9  ","   0")])

#finding length of each column
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]

df.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#|       3|       5|       3|       3|       4|
#+--------+--------+--------+--------+--------+

#trim on all the df columns
expr=[trim(col(col_name)).name(col_name) for col_name in df.columns]

df1=df.select(expr)
df1.show()
#+---+-----+---+---+---+
#| _1|   _2| _3| _4| _5|
#+---+-----+---+---+---+
#|   |12343|   |  9|  0|
#+---+-----+---+---+---+

#length on df1 columns
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df1.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#|       0|       5|       0|       1|       1|
#+--------+--------+--------+--------+--------+