Python 如何基于开始和结束时间将多个列值连接到Panda dataframe中的单个列中
我是Python新手,我正在尝试创建一个类似于使用pandas的数据库 下面是我的df的简化版本:Python 如何基于开始和结束时间将多个列值连接到Panda dataframe中的单个列中,python,pandas,dataframe,time-series,concatenation,Python,Pandas,Dataframe,Time Series,Concatenation,我是Python新手,我正在尝试创建一个类似于使用pandas的数据库 下面是我的df的简化版本: Timestamp A B C 0 2013-02-01 1 0 0 1 2013-02-02 2 10 18 2 2013-02-03 3 0 19 3 2013-02-04 4 12 20 4 2013-02-05 0 13 21 5 2013-02-06 6 14 22 6 2013-02-0
Timestamp A B C
0 2013-02-01 1 0 0
1 2013-02-02 2 10 18
2 2013-02-03 3 0 19
3 2013-02-04 4 12 20
4 2013-02-05 0 13 21
5 2013-02-06 6 14 22
6 2013-02-07 7 15 23
7 2013-02-08 0 0 0
我做的第一件事是使用以下代码创建一个新的空数据框来存储数据:
# Create frequent pattern source database
df_frequent_pattern = pd.DataFrame(columns = ["Start Time", "End Time", "Active Appliances"])
# Create start_time and end_time series using pd.date_range
df_frequent_pattern["Start Time"] = pd.date_range("2013-02-1", "2013-02-08", freq = "D")
df_frequent_pattern["End Time"] = pd.date_range("2013-02-2", "2013-02-09", freq = "D")
其输出如下:
Start Time End Time Active Appliances
0 2013-02-01 2013-02-02 NaN
1 2013-02-02 2013-02-03 NaN
2 2013-02-03 2013-02-04 NaN
3 2013-02-04 2013-02-05 NaN
4 2013-02-05 2013-02-06 NaN
5 2013-02-06 2013-02-07 NaN
6 2013-02-07 2013-02-08 NaN
7 2013-02-08 2013-02-09 NaN
基于和堆栈溢出帖子,我编写了以下代码,以将设备分配到正确的时间分辨率:
# Add the data to the correct 'active' period based on interval and merge the active appliances in the "active appliances column"
# Row counter for the loop
rows = 8
for row in range(rows):
# Check if appliance is active during time resoltuion
if df_frequent_pattern["Start Time"] <= df["Timestamp"] | df["Timestamp" <= df_frequent_pattern["End Time"]:
# Add all the appliance active during the time resolution to the column as a string value (e.g. "A, B, C")
df_frequent_pattern["Active Appliances"] = df["A", "B", "C"].apply(lambda row: '_'.join(row.values.astype(str)), axis = 1)
然而,根据第二篇文章,“=”似乎放置正确。关于如何使用df获得如上所示的预期结果,有什么想法吗
应该是这样的:
Start Time End Time Active Appliances
0 2013-02-01 2013-02-02 "A"
1 2013-02-02 2013-02-03 "A,B,C"
2 2013-02-03 2013-02-04 "A,C"
3 2013-02-04 2013-02-05 "A,B,C"
4 2013-02-05 2013-02-06 "A,B,C"
5 2013-02-06 2013-02-07 "A,B,C"
6 2013-02-07 2013-02-08 "A,B,C"
7 2013-02-08 2013-02-09 ""
让我们分几个步骤来完成 首先,让我们确保您的
时间戳是datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
然后,我们可以根据时间戳的最小值和最大值创建一个新的数据帧
df1 = pd.DataFrame({'start_time' : pd.date_range(df['Timestamp'].min(), df['Timestamp'].max())})
df1['end_time'] = df1['start_time'] + pd.DateOffset(days=1)
start_time end_time
0 2013-02-01 2013-02-02
1 2013-02-02 2013-02-03
2 2013-02-03 2013-02-04
3 2013-02-04 2013-02-05
4 2013-02-05 2013-02-06
5 2013-02-06 2013-02-07
6 2013-02-07 2013-02-08
7 2013-02-08 2013-02-09
现在,我们需要创建一个数据帧来合并到您的start\u time
列中
让我们筛选出任何小于0的值,并创建活动设备的列表:
df = df.set_index('Timestamp')
# the remaining columns MUST be integers for this to work.
# or you'll need to subselect them.
df2 = df.mask(df.le(0)).stack().reset_index(1).groupby(level=0)\
.agg(active_appliances=('level_1',list)).reset_index(0)
# change .agg(active_appliances=('level_1',list) >
# to .agg(active_appliances=('level_1',','.join)
# if you prefer strings.
Timestamp active_appliances
0 2013-02-01 [A]
1 2013-02-02 [A, B, C]
2 2013-02-03 [A, C]
3 2013-02-04 [A, B, C]
4 2013-02-05 [B, C]
5 2013-02-06 [A, B, C]
6 2013-02-07 [A, B, C]
然后我们可以合并:
final = pd.merge(df1,df2,left_on='start_time',right_on='Timestamp',how='left').drop('Timestamp',1)
start_time end_time active_appliances
0 2013-02-01 2013-02-02 [A]
1 2013-02-02 2013-02-03 [A, B, C]
2 2013-02-03 2013-02-04 [A, C]
3 2013-02-04 2013-02-05 [A, B, C]
4 2013-02-05 2013-02-06 [B, C]
5 2013-02-06 2013-02-07 [A, B, C]
6 2013-02-07 2013-02-08 [A, B, C]
7 2013-02-08 2013-02-09 NaN
在你的第一篇文章中使用df[[“A”,“B”,“C”]]进行扎实的研究,你几乎有一个完美的问题,你只需要添加你的样本输出,但我认为你需要df[[“A”,“B”,“C”]].astype(str).agg(''.''.join,1)
@Manakin我猜他在循环中的逻辑似乎也不正确,对吧?可能是OP需要添加预期输出。@ShubhamSharma由于缺少预期输出,我不确定,从我可以看到任何大于1
的值都是活动设备,但是对于3-4
@Manakin,如果值不是0,则设备是活动的。在我的数据集中,这些值对应于特定时间点该设备的总能耗。我试图建立一个数据库,在这个数据库中,我可以看到哪些设备在特定时间内处于活动状态(有能耗)resolution@ShubhamSharma谢谢,我用SQL写了一个类似的存储过程来解决这个问题。在Python中使用字符串更容易thoOh哇!太好了。我同意在python中使用字符串要容易得多。@Manakin非常感谢!我花了这么多时间在这上面,而你却毫不费力
final = pd.merge(df1,df2,left_on='start_time',right_on='Timestamp',how='left').drop('Timestamp',1)
start_time end_time active_appliances
0 2013-02-01 2013-02-02 [A]
1 2013-02-02 2013-02-03 [A, B, C]
2 2013-02-03 2013-02-04 [A, C]
3 2013-02-04 2013-02-05 [A, B, C]
4 2013-02-05 2013-02-06 [B, C]
5 2013-02-06 2013-02-07 [A, B, C]
6 2013-02-07 2013-02-08 [A, B, C]
7 2013-02-08 2013-02-09 NaN