Python Pyspark-将具有不同架构的文件合并到一个主文件中_Python_Dataframe_Apache Spark_Pyspark_Apache Spark Sql

Python Pyspark-将具有不同架构的文件合并到一个主文件中

python dataframe apache-spark pyspark

Python Pyspark-将具有不同架构的文件合并到一个主文件中,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有九个csv文件，如下所示： trans_1 +------------------+-----------+-------------+----------+-----+-----+--------------------+ |store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key| +------------------+-----------+------------

我有九个csv文件，如下所示：

trans_1
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              9807|83215400105|           -1|2015-09-09|42.72|    1|19815980756712015...|
|              9807| 6024538816|           -1|2015-10-28|27.57|    1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_2
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              7296|85375900278|           -1|2015-06-26| 4.97|    1|12548729658922015...|
|              7296|81526001001| 139537965459|2015-05-01|44.48|    1|24990729650922015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_3
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              9807|83215400105|           -1|2015-09-09|42.72|    1|19815980756712015...|
|              9807| 6024538816|           -1|2015-10-28|27.57|    1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_4
+-------------+----------+------------------+-----------+-----+-----+----------------+
|collector_key|  trans_dt|store_location_key|product_key|sales|units|       trans_key|
+-------------+----------+------------------+-----------+-----+-----+----------------+
|           -1| 6/26/2015|              8142| 4319416816| 9.42|    1|1.6945500000E+25|
|           -1|10/25/2015|              8142| 6210700491| 24.9|    1|3.4001800000E+25|
+-------------+----------+------------------+-----------+-----+-----+----------------+
only showing top 2 rows

trans_5
+-------------+----------+------------------+---------------+-----+-----+--------------------+
|collector_key|  trans_dt|store_location_key|    product_key|sales|units|           trans_key|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
|           -1|2015-10-28|              6973|999999999999513|  0.0|    1|31575569731182201...|
|           -1|2015-07-24|              6973|    77105810883| 8.53|    1|31216969731182201...|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
only showing top 2 rows

trans_6
+-------------+----------+------------------+----------------+-----+-----+----------------+
|collector_key|  trans_dt|store_location_key|     product_key|sales|units|        trans_id|
+-------------+----------+------------------+----------------+-----+-----+----------------+
|           -1|10/28/2015|              6973|1000000000000000|  0.0| null|3.1575600000E+25|
|           -1| 7/24/2015|              6973|     77105810883| 8.53| null|3.1217000000E+25|
+-------------+----------+------------------+----------------+-----+-----+----------------+
only showing top 2 rows

trans_7
+-------------+----------+------------------+-----------+-----+-----+--------------------+
|collector_key|  trans_dt|store_location_key|product_key|sales|units|            trans_id|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
|           -1|2015-09-09|              9807|83215400105|42.72|    1|19815980756712015...|
|           -1|2015-10-28|              9807| 6024538816|27.57|    1|21718980756712015...|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
only showing top 2 rows

trans_8
+----------------+-------------+----------+------------------+-----+-----+----------------+
|     product_key|collector_key|  trans_dt|store_location_key|sales|units|        trans_id|
+----------------+-------------+----------+------------------+-----+-----+----------------+
|1000000000000000|           -1|10/28/2015|              6973| null|    1|3.1575600000E+25|
|     77105810883|           -1| 7/24/2015|              6973| null|    1|3.1217000000E+25|
+----------------+-------------+----------+------------------+-----+-----+----------------+
only showing top 2 rows

trans_9
+-----------+-------------+----------+------------------+-----+-----+--------------------+
|product_key|collector_key|  trans_dt|store_location_key|sales|units|            trans_id|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
| 4319416816|           -1|2015-06-26|              8142| 9.42|    1|16945481425160201...|
| 6210700491|           -1|2015-10-25|              8142| 24.9|    1|34001814221225201...|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
only showing top 2 rows

它们都有相同的列，但位置不同。我使用此代码读取所有文件，但有错误

trans = spark\
    .read\
    .format("csv")\
    .option("inferSchema","true")\
    .option("header","true")\
    .load("/Users/xyz/Downloads/xyz/trans_fact*.csv")

我只想在pyspark中编写代码，这样我就可以读取所有这些文件，并将它们合并到一个数据帧（csv）中，在正确的列顺序下使用正确的数据。

您可以逐个加载csv文件，添加可能缺少的列，对列进行排序，然后合并它们：

导入操作系统
def加载单个文件（目录）：
dirpath，x，files=next（os.walk（dir））
对于文件中的f：
产生火花\
.读\
.格式（“csv”）\
.选项（“推断模式”、“真”）\
.选项（“标题”、“正确”）\
.load（os.path.join（dirpath，f））
导入集合
def按顺序添加缺少的列（df，唯一列）：
缺少_cols={col:F.lit（None）.unique_cols中col的别名（col）（如果col不在df.columns中）}
existing_cols={col:F.col（col）for col in unique_cols if col in df.columns}
cols=dict（缺少cols，**现有cols）
cols=list（collections.orderedict（sorted（cols.items（）））.values（））
返回df.select（cols）
dfs=列表（加载单个文件（“测试数据”））
unique_cols=已排序（设置（[col代表cols in[df.columns代表dfs中的df.columns]代表cols中的col]））
df=dfs[0]
df=按顺序添加缺少的列（df，唯一列）
对于dfs[1:]中的下一个_df：
df=df.union（按顺序添加缺少的列（下一个列，唯一列））

这种方法比一次加载所有文件要慢，因为Spark不会并行读取文件。根据文件的大小，这可能是问题，也可能不是问题

编辑：以下建议包括自动添加缺少的列的逻辑。

我不知道为什么，但它自己创建了一个“trans_id”列，这给了我一个错误。你知道它来自哪里吗？@SahilNagpal可能文件没有所有相同的列。如果其中一个文件有一个附加列

trans\u id

，您将看到一个错误