Pyspark 使用列值作为文件名保存spark数据帧
如何使用列值作为文件名将spark数据帧保存到文件。可能吗Pyspark 使用列值作为文件名保存spark数据帧,pyspark,Pyspark,如何使用列值作为文件名将spark数据帧保存到文件。可能吗 +--------------------------+----------+-----------------+-----------------------------------+ |ID |CITY |DATE |name | +--------------------------+----
+--------------------------+----------+-----------------+-----------------------------------+
|ID |CITY |DATE |name |
+--------------------------+----------+-----------------+-----------------------------------+
|1 | |2011-01-01 |20110101_DATA.snappy.parquet |
|2 | |2011-01-01 |20110101_DATA.snappy.parquet |
|3 | |2011-01-01 |20110101_DATA.snappy.parquet |
|4 |Chicago |2011-01-01 |20110101_DATA.snappy.parquet |
|5 |Mansfield |2011-01-02 |20110102_DATA.snappy.parquet |
|6 |Pittsburgh|2011-01-02 |20110102_DATA.snappy.parquet |
|7 | |2011-01-02 |20110102_DATA.snappy.parquet |
|8 |Clarion |2011-01-03 |20110103_DATA.snappy.parquet |
|9 |Storrs |2011-01-03 |20110103_DATA.snappy.parquet |
|10 | |2011-01-03 |20110103_DATA.snappy.parquet |
+--------------------------+----------+-----------------+-----------------------------------+
预期产出:
按日期分区,并在将数据保存为拼花地板时使用名称值作为文件名。o/p将是3个文件
/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet
Spark无法根据需要在输出拼花地板文件中以本机方式创建自定义名称。您可以使用以下代码,但它不是可伸缩的解决方案,因为您使用了
.collect()
操作
# In large dataframe maybe it will not work
unique_filename = [row.name for row in df.select('name').distinct().collect()]
for filename in unique_filenames:
output_filename = "/DATE=" + filename[0:4] + "-" + filename[4:6] + "-" + filename[6:8] + "/" + filename
df.select("ID", "CITY", "DATE")
.filter(df['name']==filename) \
.write \
.parquet(output_filename)
你会得到你想要的:
/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet
你能举个例子吗?@ggeop更新为example再定义一个,你只需要空文件,或者你想在其中包含数据?如果是,请输入哪些数据。数据框将很大?是的,文件将包含数据。它将包含除名称列和日期之外的所有列(因为日期将出现在分区中)