Python 按组合并以填充时间序列
我试图合并每个组的两个数据帧,以填补每个用户的时间。考虑下面的PiSkad数据流,Python 按组合并以填充时间序列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我试图合并每个组的两个数据帧,以填补每个用户的时间。考虑下面的PiSkad数据流, df = sqlContext.createDataFrame( [ ('2018-03-01 00:00:00', 'A', 5), ('2018-03-01 03:00:00', 'A', 7), ('2018-03-01 02:00:00', 'B', 3), ('2018-03-01 04:00:00', 'B', 2) ]
df = sqlContext.createDataFrame(
[
('2018-03-01 00:00:00', 'A', 5),
('2018-03-01 03:00:00', 'A', 7),
('2018-03-01 02:00:00', 'B', 3),
('2018-03-01 04:00:00', 'B', 2)
],
('datetime', 'username', 'count')
)
#and
df1 = sqlContext.createDataFrame(
[
('2018-03-01 00:00:00',1),
('2018-03-01 01:00:00', 2),
('2018-03-01 02:00:00', 2),
('2018-03-01 03:00:00', 3),
('2018-03-01 04:00:00', 1),
('2018-03-01 05:00:00', 5)
],
('datetime', 'val')
)
生产,
+-------------------+--------+-----+
| datetime|username|count|
+-------------------+--------+-----+
|2018-03-01 00:00:00| A| 5|
|2018-03-01 03:00:00| A| 7|
|2018-03-01 02:00:00| B| 3|
|2018-03-01 04:00:00| B| 2|
+-------------------+--------+-----+
#and
+-------------------+---+
| datetime|val|
+-------------------+---+
|2018-03-01 00:00:00| 1|
|2018-03-01 01:00:00| 2|
|2018-03-01 02:00:00| 2|
|2018-03-01 03:00:00| 3|
|2018-03-01 04:00:00| 1|
|2018-03-01 05:00:00| 5|
+-------------------+---+
df1
中的val
列与最终结果无关,因此我们可以将其删除。最终,预期的结果是
+-------------------+--------+-----+
| datetime|username|count|
+-------------------+--------+-----+
|2018-03-01 00:00:00| A| 5|
|2018-03-01 01:00:00| A| 0|
|2018-03-01 02:00:00| A| 0|
|2018-03-01 03:00:00| A| 7|
|2018-03-01 04:00:00| A| 0|
|2018-03-01 05:00:00| A| 0|
|2018-03-01 00:00:00| B| 0|
|2018-03-01 01:00:00| B| 0|
|2018-03-01 02:00:00| B| 3|
|2018-03-01 03:00:00| B| 0|
|2018-03-01 04:00:00| B| 2|
|2018-03-01 05:00:00| B| 0|
+-------------------+--------+-----+
我尝试了groupBy()
和加入
,但没有成功。我还尝试创建一个函数并将其注册为pandas\u udf()
,但仍然不起作用,即
df.groupBy('usernames').join(df1, 'datetime', 'right')
及
有什么建议吗?只需跨产品不同的时间戳和用户名,并使用数据进行外部连接:
from pyspark.sql.functions import broadcast
(broadcast(df1.select("datetime").distinct())
.crossJoin(df.select("username").distinct())
.join(df, ["datetime", "username"], "leftouter")
.na.fill(0))
要使用pandas\u udf
您需要一个本地对象作为引用
from pyspark.sql.functions import PandasUDFType, pandas_udf
def fill_time(df1):
@pandas_udf('datetime string, username string, count double', PandasUDFType.GROUPED_MAP)
def _(df):
df_ = df.merge(df1, on='datetime', how='right')
df_["username"] = df_["username"].ffill().bfill()
return df_
return _
(df.groupBy("username")
.apply(fill_time(
df1.select("datetime").distinct().toPandas()
))
.na.fill(0))
但它将比仅使用SQL的解决方案慢
from pyspark.sql.functions import PandasUDFType, pandas_udf
def fill_time(df1):
@pandas_udf('datetime string, username string, count double', PandasUDFType.GROUPED_MAP)
def _(df):
df_ = df.merge(df1, on='datetime', how='right')
df_["username"] = df_["username"].ffill().bfill()
return df_
return _
(df.groupBy("username")
.apply(fill_time(
df1.select("datetime").distinct().toPandas()
))
.na.fill(0))