Python Pyspark-将具有不同架构的文件合并到一个主文件中
我有九个csv文件,如下所示:Python Pyspark-将具有不同架构的文件合并到一个主文件中,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有九个csv文件,如下所示: trans_1 +------------------+-----------+-------------+----------+-----+-----+--------------------+ |store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key| +------------------+-----------+------------
trans_1
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
| 9807|83215400105| -1|2015-09-09|42.72| 1|19815980756712015...|
| 9807| 6024538816| -1|2015-10-28|27.57| 1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows
trans_2
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
| 7296|85375900278| -1|2015-06-26| 4.97| 1|12548729658922015...|
| 7296|81526001001| 139537965459|2015-05-01|44.48| 1|24990729650922015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows
trans_3
+------------------+-----------+-------------+----------+-----+-----+--------------------+
|store_location_key|product_key|collector_key| trans_dt|sales|units| trans_key|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
| 9807|83215400105| -1|2015-09-09|42.72| 1|19815980756712015...|
| 9807| 6024538816| -1|2015-10-28|27.57| 1|21718980756712015...|
+------------------+-----------+-------------+----------+-----+-----+--------------------+
only showing top 2 rows
trans_4
+-------------+----------+------------------+-----------+-----+-----+----------------+
|collector_key| trans_dt|store_location_key|product_key|sales|units| trans_key|
+-------------+----------+------------------+-----------+-----+-----+----------------+
| -1| 6/26/2015| 8142| 4319416816| 9.42| 1|1.6945500000E+25|
| -1|10/25/2015| 8142| 6210700491| 24.9| 1|3.4001800000E+25|
+-------------+----------+------------------+-----------+-----+-----+----------------+
only showing top 2 rows
trans_5
+-------------+----------+------------------+---------------+-----+-----+--------------------+
|collector_key| trans_dt|store_location_key| product_key|sales|units| trans_key|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
| -1|2015-10-28| 6973|999999999999513| 0.0| 1|31575569731182201...|
| -1|2015-07-24| 6973| 77105810883| 8.53| 1|31216969731182201...|
+-------------+----------+------------------+---------------+-----+-----+--------------------+
only showing top 2 rows
trans_6
+-------------+----------+------------------+----------------+-----+-----+----------------+
|collector_key| trans_dt|store_location_key| product_key|sales|units| trans_id|
+-------------+----------+------------------+----------------+-----+-----+----------------+
| -1|10/28/2015| 6973|1000000000000000| 0.0| null|3.1575600000E+25|
| -1| 7/24/2015| 6973| 77105810883| 8.53| null|3.1217000000E+25|
+-------------+----------+------------------+----------------+-----+-----+----------------+
only showing top 2 rows
trans_7
+-------------+----------+------------------+-----------+-----+-----+--------------------+
|collector_key| trans_dt|store_location_key|product_key|sales|units| trans_id|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
| -1|2015-09-09| 9807|83215400105|42.72| 1|19815980756712015...|
| -1|2015-10-28| 9807| 6024538816|27.57| 1|21718980756712015...|
+-------------+----------+------------------+-----------+-----+-----+--------------------+
only showing top 2 rows
trans_8
+----------------+-------------+----------+------------------+-----+-----+----------------+
| product_key|collector_key| trans_dt|store_location_key|sales|units| trans_id|
+----------------+-------------+----------+------------------+-----+-----+----------------+
|1000000000000000| -1|10/28/2015| 6973| null| 1|3.1575600000E+25|
| 77105810883| -1| 7/24/2015| 6973| null| 1|3.1217000000E+25|
+----------------+-------------+----------+------------------+-----+-----+----------------+
only showing top 2 rows
trans_9
+-----------+-------------+----------+------------------+-----+-----+--------------------+
|product_key|collector_key| trans_dt|store_location_key|sales|units| trans_id|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
| 4319416816| -1|2015-06-26| 8142| 9.42| 1|16945481425160201...|
| 6210700491| -1|2015-10-25| 8142| 24.9| 1|34001814221225201...|
+-----------+-------------+----------+------------------+-----+-----+--------------------+
only showing top 2 rows
它们都有相同的列,但位置不同。我使用此代码读取所有文件,但有错误
trans = spark\
.read\
.format("csv")\
.option("inferSchema","true")\
.option("header","true")\
.load("/Users/xyz/Downloads/xyz/trans_fact*.csv")
我只想在pyspark中编写代码,这样我就可以读取所有这些文件,并将它们合并到一个数据帧(csv)中,在正确的列顺序下使用正确的数据。您可以逐个加载csv文件,添加可能缺少的列,对列进行排序,然后合并它们:
导入操作系统
def加载单个文件(目录):
dirpath,x,files=next(os.walk(dir))
对于文件中的f:
产生火花\
.读\
.格式(“csv”)\
.选项(“推断模式”、“真”)\
.选项(“标题”、“正确”)\
.load(os.path.join(dirpath,f))
导入集合
def按顺序添加缺少的列(df,唯一列):
缺少_cols={col:F.lit(None).unique_cols中col的别名(col)(如果col不在df.columns中)}
existing_cols={col:F.col(col)for col in unique_cols if col in df.columns}
cols=dict(缺少cols,**现有cols)
cols=list(collections.orderedict(sorted(cols.items())).values())
返回df.select(cols)
dfs=列表(加载单个文件(“测试数据”))
unique_cols=已排序(设置([col代表cols in[df.columns代表dfs中的df.columns]代表cols中的col]))
df=dfs[0]
df=按顺序添加缺少的列(df,唯一列)
对于dfs[1:]中的下一个_df:
df=df.union(按顺序添加缺少的列(下一个列,唯一列))
这种方法比一次加载所有文件要慢,因为Spark不会并行读取文件。根据文件的大小,这可能是问题,也可能不是问题
编辑:以下建议包括自动添加缺少的列的逻辑。我不知道为什么,但它自己创建了一个“trans_id”列,这给了我一个错误。你知道它来自哪里吗?@SahilNagpal可能文件没有所有相同的列。如果其中一个文件有一个附加列
trans\u id
,您将看到一个错误