Pyspark 如何从CSV文件中清除数据_Pyspark_Pyspark Sql_Pyspark Dataframes

Pyspark 如何从CSV文件中清除数据

pyspark

Pyspark 如何从CSV文件中清除数据,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,示例name.csv数据： Name, ,Age, ,Class, Diwakar,, ,25,, ,12, , , , , , Prabhat, ,27, ,15, Zyan, ,30, ,17, Jack, ,35, ,21, 读取csv文件： names = spark.read.csv("name.csv", header="true", inferSchema="true") names.show() 将此作为输出，我们将丢失一些数据： +-------+----+---+---

示例name.csv数据：

Name, ,Age, ,Class,
Diwakar,, ,25,, ,12,
 , , , , ,
Prabhat, ,27, ,15,
Zyan, ,30, ,17,
Jack, ,35, ,21,

读取csv文件：

names = spark.read.csv("name.csv", header="true", inferSchema="true")
names.show()

将此作为输出，我们将丢失一些数据：

+-------+----+---+---+-----+----+
|   Name|   1|Age|  3|Class| _c5|
+-------+----+---+---+-----+----+
|Diwakar|null|   | 25| null|    |
|       |    |   |   |     |null|
|Prabhat|    | 27|   |   15|null|
|   Zyan|    | 30|   |   17|null|
|   Jack|    | 35|   |   21|null|
+-------+----+---+---+-----+----+

我希望得到如下所示的输出：

+-------+---+---+---+-----+----+
|   Name|  1|Age|  3|Class| _c5|
+-------+---+---+---+-----+----+
|Diwakar|   | 25|   |   12|null|
|       |   |   |   |     |null|
|Prabhat|   | 27|   |   15|null|
|   Zyan|   | 30|   |   17|null|
|   Jack|   | 35|   |   21|null|
+-------+---+---+---+-----+----+

我们可以通过定义schema
读取所有字段，然后在读取CSV文件时使用

schema

，否则使用，我们可以获得年龄、类别列的数据示例： from pyspark.sql.functions import * from pyspark.sql.types import * #define schema with same number of columns in csv file sch=StructType([ StructField("Name", StringType(), True), StructField("1", StringType(), True), StructField("Age", StringType(), True), StructField("3", StringType(), True), StructField("Class", StringType(), True), StructField("_c5", StringType(), True), StructField("_c6", StringType(), True) ]) #reading csv file with schema df=spark.read.schema(sch).option("header",True).csv("name.csv") df.withColumn('Age', when(length(trim(col('Age'))) == 0, col('3')).otherwise(col('Age'))).\ withColumn('1',lit("")).\ withColumn('3',lit("")).\ withColumn('Class',when((col('Class').isNull())|(lower(col('Class')) == 'null'), col('_c6')).when(length(trim(col('Class'))) == 0, lit("null")).otherwise(col('Class'))).\ withColumn('_c5',lit("null")).\ drop("_c6").\ show() #+-------+---+---+---+-----+----+ #| Name| 1|Age| 3|Class| _c5| #+-------+---+---+---+-----+----+ #|Diwakar| | 25| | 12|null| #| | | | | null|null| #|Prabhat| | 27| | 15|null| #| Zyan| | 30| | 17|null| #| Jack| | 35| | 21|null| #+-------+---+---+---+-----+----+ 在我的例子中，它有150列……有没有其他方法来处理数据中的（，）问题