Csv SPARK：在目录下读取文件，某些文件缺少带有标题列的列_Csv_Pyspark

Csv SPARK：在目录下读取文件，某些文件缺少带有标题列的列

csv pyspark

Csv SPARK：在目录下读取文件，某些文件缺少带有标题列的列,csv,pyspark,Csv,Pyspark,我的目录下有4个文件，其中一个文件缺少包含该列数据的一列但当我加载到spark DF时，它并没有添加第二列，也没有填充null file1.csv name| first|second| female| raj| tarun| file2.csv name| first|second|

我的目录下有4个文件，其中一个文件缺少包含该列数据的一列

但当我加载到spark DF时，它并没有添加第二列，也没有填充null

                    file1.csv
                    name| first|second|
                    female|   raj| tarun|

                    file2.csv
                    name| first|second|
                    female|   raj| tarun|

                    file3.csv
                    name| first|second|
                    female|   raj| tarun|


                    file4.csv
                    name| second|
                    female|  tarun|





                    from pyspark.sql import SQLContext
                    sqlContext = SQLContext(sc)
                    from pyspark import SparkConf, SparkContext



                    un = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').option("delimiter",",").load('/dir/test/')
                    un.show()
                    un.registerTempTable("un1")


                    queryresult1 = sqlContext.sql("select DISTINCT hashedId from un1   ")


                    queryresult1.show()


                    Output is :  why second column is not filling with nul and third column did not shifted

                    +------+------+------+
                    |  name| first|second|
                    +------+------+------+
                    |female|   raj| tarun|
                    |female|   raj| tarun|
                    |female|   raj| tarun|
                    |  name|second|  null|
                    |female| tarun|  null|

我的目录下有4个文件，其中一个文件缺少包含该列数据的一列

但是，当我加载到spark DF时，它没有添加第2列，也没有填充null。我没有完全检查这一点，但下面的代码应该可以帮助您开始：

                    file1.csv
                    name| first|second|
                    female|   raj| tarun|

                    file2.csv
                    name| first|second|
                    female|   raj| tarun|

                    file3.csv
                    name| first|second|
                    female|   raj| tarun|


                    file4.csv
                    name| second|
                    female|  tarun|





                    from pyspark.sql import SQLContext
                    sqlContext = SQLContext(sc)
                    from pyspark import SparkConf, SparkContext



                    un = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').option("delimiter",",").load('/dir/test/')
                    un.show()
                    un.registerTempTable("un1")


                    queryresult1 = sqlContext.sql("select DISTINCT hashedId from un1   ")


                    queryresult1.show()


                    Output is :  why second column is not filling with nul and third column did not shifted

                    +------+------+------+
                    |  name| first|second|
                    +------+------+------+
                    |female|   raj| tarun|
                    |female|   raj| tarun|
                    |female|   raj| tarun|
                    |  name|second|  null|
                    |female| tarun|  null|

columns=['name', 'first', 'second']

df = sc.textFile(path to your folder)
    .map(lambda line: line.split("|")
    .filter(lambda line: line[0]!='name')
    .map(lambda line: line if len(line)==3 else [line[0],None, line[1]])
    .toDF(schema=columns)

说明：作为RDD读入，并在管道分隔符上拆分。筛选以除去每个分区中的标题行。然后，在缺少列的地方（即rdd元素的长度为2），用null填充。最后转换为数据帧

输出是什么样子的？解决这个问题的一种方法是使用textFile将其作为RDD读入，然后分割逗号上的每一行，然后将其映射为用null填充缺少的列，然后转换为DataFrame。如果你乐意做一些迂回的事情，我可以为你写一个答案。我只添加了样本文件，最初我有CSV文件，但是我仍然有任何你能建议的最好的答案。如果下面的答案对你有帮助，请考虑接受它，或者，如果你有任何问题，请告诉我，我可以修改。