Python Pyspark按列标题将一行扩展为多行
假设我有一个包含以下列的数据框:Python Pyspark按列标题将一行扩展为多行,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,假设我有一个包含以下列的数据框: # id | name | 01-Jan-10 | 01-Feb-10 | ... | 01-Jan-11 | 01-Feb-11 # ----------------------------------------------------------------- # 1 | a001 | 0 | 32 | ... | 14 | 108 # 1 | a002 | 80 |
# id | name | 01-Jan-10 | 01-Feb-10 | ... | 01-Jan-11 | 01-Feb-11
# -----------------------------------------------------------------
# 1 | a001 | 0 | 32 | ... | 14 | 108
# 1 | a002 | 80 | 0 | ... | 0 | 92
我想将其展开为如下表:
# id | name | Jan | Feb | ... | Year
# -----------------------------------
# 1 | a001 | 0 | 32 | ... | 2010
# 1 | a001 | 14 | 108 | ... | 2011
# 1 | a002 | 80 | 0 | ... | 2010
# 1 | a002 | 0 | 92 | ... | 2011
我想按年份将日期分成几行,并捕获每个月的值
在pyspark(python+spark)中,这是如何实现的?我一直在尝试收集df数据以迭代并提取每个字段以写入每一行,但我想知道是否有更聪明的spark函数可以帮助实现这一点。(新加入spark)首先
melt
DataFrame
():
下一个解析日期和提取年份和月份:
from pyspark.sql.functions import to_date, date_format, year
date = to_date("variable", "dd-MMM-yy")
parsed = df_long.select(
"id", "name", "value",
year(date).alias("year"), date_format(date, "MMM").alias("month")
)
# +---+----+-----+----+-----+
# | id|name|value|year|month|
# +---+----+-----+----+-----+
# | 1|a001| 0|2010| Jan|
# | 1|a001| 32|2010| Feb|
# | 1|a001| 14|2011| Jan|
# | 1|a001| 108|2011| Feb|
# | 2| a02| 80|2010| Jan|
# | 2| a02| 0|2010| Feb|
# | 2| a02| 0|2011| Jan|
# | 2| a02| 92|2011| Feb|
# +---+----+-----+----+-----+
最后,pivot
():
第一个
melt
DataFrame
():
下一个解析日期和提取年份和月份:
from pyspark.sql.functions import to_date, date_format, year
date = to_date("variable", "dd-MMM-yy")
parsed = df_long.select(
"id", "name", "value",
year(date).alias("year"), date_format(date, "MMM").alias("month")
)
# +---+----+-----+----+-----+
# | id|name|value|year|month|
# +---+----+-----+----+-----+
# | 1|a001| 0|2010| Jan|
# | 1|a001| 32|2010| Feb|
# | 1|a001| 14|2011| Jan|
# | 1|a001| 108|2011| Feb|
# | 2| a02| 80|2010| Jan|
# | 2| a02| 0|2010| Feb|
# | 2| a02| 0|2011| Jan|
# | 2| a02| 92|2011| Feb|
# +---+----+-----+----+-----+
最后,pivot
():
太好了,谢谢你!我在末尾添加了。orderBy(“id”,“year”),让它看起来完全符合我的需要。太完美了,谢谢!我在末尾添加了.orderBy(“id”,“year”),让它看起来完全符合我的需要。
# Providing a list of levels is not required but will make the process faster
# months = [
# "Jan", "Feb", "Mar", "Apr", "May", "Jun",
# "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
# ]
months = ["Jan", "Feb"]
parsed.groupBy("id", "name", "year").pivot("month", months).sum("value")
# +---+----+----+---+---+
# | id|name|year|Feb|Jan|
# +---+----+----+---+---+
# | 2| a02|2011| 92| 0|
# | 1|a001|2010| 32| 0|
# | 1|a001|2011|108| 14|
# | 2| a02|2010| 0| 80|
# +---+----+----+---+---+