Python Pyspark-将具有不同架构的文件合并到一个主文件中

Python Pyspark-将具有不同架构的文件合并到一个主文件中,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有九个csv文件,如下所示: trans_1 +------------------+-----------+-------------+----------+-----+-----+--------------------+ |store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key| +------------------+-----------+------------

我有九个csv文件,如下所示:

trans_1
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              9807|83215400105|           -1|2015-09-09|42.72|    1|19815980756712015...|
|              9807| 6024538816|           -1|2015-10-28|27.57|    1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_2
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              7296|85375900278|           -1|2015-06-26| 4.97|    1|12548729658922015...|
|              7296|81526001001| 139537965459|2015-05-01|44.48|    1|24990729650922015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_3
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key|  trans_dt|sales|units|           trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|              9807|83215400105|           -1|2015-09-09|42.72|    1|19815980756712015...|
|              9807| 6024538816|           -1|2015-10-28|27.57|    1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows

trans_4
+-------------+----------+------------------+-----------+-----+-----+----------------+
|collector_key|  trans_dt|store_location_key|product_key|sales|units|       trans_key|
+-------------+----------+------------------+-----------+-----+-----+----------------+
|           -1| 6/26/2015|              8142| 4319416816| 9.42|    1|1.6945500000E+25|
|           -1|10/25/2015|              8142| 6210700491| 24.9|    1|3.4001800000E+25|
+-------------+----------+------------------+-----------+-----+-----+----------------+
only showing top 2 rows

trans_5
+-------------+----------+------------------+---------------+-----+-----+--------------------+
|collector_key|  trans_dt|store_location_key|    product_key|sales|units|           trans_key|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
|           -1|2015-10-28|              6973|999999999999513|  0.0|    1|31575569731182201...|
|           -1|2015-07-24|              6973|    77105810883| 8.53|    1|31216969731182201...|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
only showing top 2 rows

trans_6
+-------------+----------+------------------+----------------+-----+-----+----------------+
|collector_key|  trans_dt|store_location_key|     product_key|sales|units|        trans_id|
+-------------+----------+------------------+----------------+-----+-----+----------------+
|           -1|10/28/2015|              6973|1000000000000000|  0.0| null|3.1575600000E+25|
|           -1| 7/24/2015|              6973|     77105810883| 8.53| null|3.1217000000E+25|
+-------------+----------+------------------+----------------+-----+-----+----------------+
only showing top 2 rows

trans_7
+-------------+----------+------------------+-----------+-----+-----+--------------------+
|collector_key|  trans_dt|store_location_key|product_key|sales|units|            trans_id|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
|           -1|2015-09-09|              9807|83215400105|42.72|    1|19815980756712015...|
|           -1|2015-10-28|              9807| 6024538816|27.57|    1|21718980756712015...|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
only showing top 2 rows

trans_8
+----------------+-------------+----------+------------------+-----+-----+----------------+
|     product_key|collector_key|  trans_dt|store_location_key|sales|units|        trans_id|
+----------------+-------------+----------+------------------+-----+-----+----------------+
|1000000000000000|           -1|10/28/2015|              6973| null|    1|3.1575600000E+25|
|     77105810883|           -1| 7/24/2015|              6973| null|    1|3.1217000000E+25|
+----------------+-------------+----------+------------------+-----+-----+----------------+
only showing top 2 rows

trans_9
+-----------+-------------+----------+------------------+-----+-----+--------------------+
|product_key|collector_key|  trans_dt|store_location_key|sales|units|            trans_id|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
| 4319416816|           -1|2015-06-26|              8142| 9.42|    1|16945481425160201...|
| 6210700491|           -1|2015-10-25|              8142| 24.9|    1|34001814221225201...|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
only showing top 2 rows
它们都有相同的列,但位置不同。我使用此代码读取所有文件,但有错误

trans = spark\
    .read\
    .format("csv")\
    .option("inferSchema","true")\
    .option("header","true")\
    .load("/Users/xyz/Downloads/xyz/trans_fact*.csv")

我只想在pyspark中编写代码,这样我就可以读取所有这些文件,并将它们合并到一个数据帧(csv)中,在正确的列顺序下使用正确的数据。

您可以逐个加载csv文件,添加可能缺少的列,对列进行排序,然后合并它们:

导入操作系统
def加载单个文件(目录):
dirpath,x,files=next(os.walk(dir))
对于文件中的f:
产生火花\
.读\
.格式(“csv”)\
.选项(“推断模式”、“真”)\
.选项(“标题”、“正确”)\
.load(os.path.join(dirpath,f))
导入集合
def按顺序添加缺少的列(df,唯一列):
缺少_cols={col:F.lit(None).unique_cols中col的别名(col)(如果col不在df.columns中)}
existing_cols={col:F.col(col)for col in unique_cols if col in df.columns}
cols=dict(缺少cols,**现有cols)
cols=list(collections.orderedict(sorted(cols.items())).values())
返回df.select(cols)
dfs=列表(加载单个文件(“测试数据”))
unique_cols=已排序(设置([col代表cols in[df.columns代表dfs中的df.columns]代表cols中的col]))
df=dfs[0]
df=按顺序添加缺少的列(df,唯一列)
对于dfs[1:]中的下一个_df:
df=df.union(按顺序添加缺少的列(下一个列,唯一列))
这种方法比一次加载所有文件要慢,因为Spark不会并行读取文件。根据文件的大小,这可能是问题,也可能不是问题


编辑:以下建议包括自动添加缺少的列的逻辑。

我不知道为什么,但它自己创建了一个“trans_id”列,这给了我一个错误。你知道它来自哪里吗?@SahilNagpal可能文件没有所有相同的列。如果其中一个文件有一个附加列
trans\u id
,您将看到一个错误