Apache spark 删除Spark数据帧-PySpark中的空白时出错
我正在将csv文件读入spark数据帧。csv在许多列中都有空格“”,我想删除这些空格。csv中有500列,因此我无法在代码中手动指定特定列 样本数据:Apache spark 删除Spark数据帧-PySpark中的空白时出错,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在将csv文件读入spark数据帧。csv在许多列中都有空格“”,我想删除这些空格。csv中有500列,因此我无法在代码中手动指定特定列 样本数据: ADVANCE_TYPE CHNG_DT BU_IN A 20190718 1 20190728 2 20190714 B 20190705 201
ADVANCE_TYPE CHNG_DT BU_IN
A 20190718 1
20190728 2
20190714
B 20190705
20190724 4
代码:
但这些代码并没有删除空的空格。请帮忙 您可以使用列表理解对所有需要的列应用trim
示例:
df=spark.createDataFrame([(" ","12343"," ","9 "," 0")])
#finding length of each column
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#| 3| 5| 3| 3| 4|
#+--------+--------+--------+--------+--------+
#trim on all the df columns
expr=[trim(col(col_name)).name(col_name) for col_name in df.columns]
df1=df.select(expr)
df1.show()
#+---+-----+---+---+---+
#| _1| _2| _3| _4| _5|
#+---+-----+---+---+---+
#| |12343| | 9| 0|
#+---+-----+---+---+---+
#length on df1 columns
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df1.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#| 0| 5| 0| 1| 1|
#+--------+--------+--------+--------+--------+
您可以使用列表理解对所有必需的列应用trim
示例:
df=spark.createDataFrame([(" ","12343"," ","9 "," 0")])
#finding length of each column
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#| 3| 5| 3| 3| 4|
#+--------+--------+--------+--------+--------+
#trim on all the df columns
expr=[trim(col(col_name)).name(col_name) for col_name in df.columns]
df1=df.select(expr)
df1.show()
#+---+-----+---+---+---+
#| _1| _2| _3| _4| _5|
#+---+-----+---+---+---+
#| |12343| | 9| 0|
#+---+-----+---+---+---+
#length on df1 columns
expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df1.select(expr).show()
#+--------+--------+--------+--------+--------+
#|length_1|length_2|length_3|length_4|length_5|
#+--------+--------+--------+--------+--------+
#| 0| 5| 0| 1| 1|
#+--------+--------+--------+--------+--------+