Python 基于列值的日期时间差总和
我有一个数据框,看起来像:Python 基于列值的日期时间差总和,python,pandas,Python,Pandas,我有一个数据框,看起来像: field1 field2 field3 time t1 1 1 1 t2 1 1 0 t3 2 3 1 t4 3 3 0 t5 1 2 0 fie
field1 field2 field3
time
t1 1 1 1
t2 1 1 0
t3 2 3 1
t4 3 3 0
t5 1 2 0
field3=0 field3=1
(1,1) 2 min 1 min
(2,3) ... ...
(3,3) ... ...
(1,2) ... ...
时间的格式为yyyy-mm-dd hh:mm:ss
,当前正在为数据帧编制索引
字段1
和字段2
用于识别项目,因此元组(字段1,字段2)
对应于世界某处的特定传感器<代码>字段3是该传感器在给定时间的值,取值为0或1
我希望将数据帧分组(field1,field2),并将每个传感器从字段3获取每个值的总时间相加。因此,如果t1='2016-07-20 00:00:00'
和t2='2016-07-20 00:01:00'
,并且当前时间是'2016-07-20 00:03:00'
,我将有一个新的数据帧,看起来像:
field1 field2 field3
time
t1 1 1 1
t2 1 1 0
t3 2 3 1
t4 3 3 0
t5 1 2 0
field3=0 field3=1
(1,1) 2 min 1 min
(2,3) ... ...
(3,3) ... ...
(1,2) ... ...
我假设从t1
到t2
,field3
的值为1,从t2
开始为0,因为(1,1)不会再次出现在数据帧中。1min
来自t2-t1
,2min
来自current\u time-t2
2分钟
和1分钟
可以是任何格式(可以是总分/秒、时间增量或其他格式)
我尝试了以下方法:
import pandas as pd
from collections import defaultdict, namedtuple
# so i can create a defaultdict(Field3) and save some logic
class Field3(object):
def __init__(self):
self.zero= pd.Timedelta('0 days')
self.one = pd.Timedelta('0 days')
# used to map to field3 in a dictionary
Sensor = namedtuple('Sensor','field1 field2')
# the dataframe mentioned above
df = pd.DataFrame(...)
# iterate through each row of the dataframe and map from (field1,field2) to
# field3, adding time based on the value of field3 in the frame and the
# time difference between this row and the next
rows = list(df.iterrows())
sensor_to_field3 = defaultdict(Field3)
for i in xrange(len(rows)-1):
sensor = Sensor(field1=rows[i][1][0],field2=rows[i][1][1])
if rows[i][1][2]: sensor_to_field3[spot].one += rows[i+1][0]-rows[i][0]
else: spot_to_status[spot].zero += rows[i+1][0]-rows[i][0]
spot_to_status = {k:[v] for k,v in sensor_to_field3.iteritems()}
result = pd.DataFrame(sensor_to_field3,index=[0])
这基本上让我明白了,但我想(尽管目前它只在整个表中有一个传感器时起作用,如果有更好的解决方法,我真的不想处理这个问题)
我觉得应该有更好的方法来解决这个问题。类似于在
field1、field2
上进行分组,然后根据field3
和time
索引聚合timedelta,但我不知道该怎么做。成功地获得了它,以防其他人遇到类似的情况。仍然不确定它是否是最佳的,但感觉比我做的更好
我更改了原始数据帧,将时间作为一列包含,并且只使用整数索引
def create_time_deltas(dataframe):
# add a timedelta column
dataframe['timedelta'] = pd.Timedelta(minutes=0)
# iterate over each row and set the timedelta to the difference of the next one and this one
for i in dataframe.index[:-1]:
dataframe.set_value(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time'])
# set the last time value, which couldn't be set earlier because index out of bounds
dataframe.set_value(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time'])
return dataframe
def group_by_field3_time(dataframe, start=None, stop=None):
# optionally set time bounds on what to care about
stop = stop or pd.to_datetime(datetime.now())
recent = dataframe.loc[logical_and(start < df['time'] , df['time'] < stop)]
# groupby and apply to create a new dataframe with the time_deltas column
by_td = df.groupby(['field1','field2']).apply(create_time_deltas)
# sum the timedeltas for each triple, which can be used later
by_oc = by_td.groupby(['field1','field2','field3']).sum()
return by_oc
def创建时间增量(数据帧):
#添加一个timedelta列
数据帧['timedelta']=pd.timedelta(分钟=0)
#迭代每一行,并将timedelta设置为下一行和这一行的差值
对于dataframe.index[:-1]中的i:
dataframe.set_值(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time']
#设置上次时间值,由于索引超出范围,无法更早设置该值
dataframe.set_值(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time']))
返回数据帧
def分组按字段3时间(数据帧,开始=无,停止=无):
#可以选择设置要关注的内容的时间界限
stop=stop或pd.to_datetime(datetime.now())
最近=dataframe.loc[逻辑_和(开始
如果有人能想出更好的方法来做这件事,我洗耳恭听,但这确实比在各地创建字典感觉好多了。设法做到了,以防其他人遇到类似的事情。仍然不确定它是否是最佳的,但感觉比我做的更好 我更改了原始数据帧,将时间作为一列包含,并且只使用整数索引
def create_time_deltas(dataframe):
# add a timedelta column
dataframe['timedelta'] = pd.Timedelta(minutes=0)
# iterate over each row and set the timedelta to the difference of the next one and this one
for i in dataframe.index[:-1]:
dataframe.set_value(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time'])
# set the last time value, which couldn't be set earlier because index out of bounds
dataframe.set_value(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time'])
return dataframe
def group_by_field3_time(dataframe, start=None, stop=None):
# optionally set time bounds on what to care about
stop = stop or pd.to_datetime(datetime.now())
recent = dataframe.loc[logical_and(start < df['time'] , df['time'] < stop)]
# groupby and apply to create a new dataframe with the time_deltas column
by_td = df.groupby(['field1','field2']).apply(create_time_deltas)
# sum the timedeltas for each triple, which can be used later
by_oc = by_td.groupby(['field1','field2','field3']).sum()
return by_oc
def创建时间增量(数据帧):
#添加一个timedelta列
数据帧['timedelta']=pd.timedelta(分钟=0)
#迭代每一行,并将timedelta设置为下一行和这一行的差值
对于dataframe.index[:-1]中的i:
dataframe.set_值(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time']
#设置上次时间值,由于索引超出范围,无法更早设置该值
dataframe.set_值(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time']))
返回数据帧
def分组按字段3时间(数据帧,开始=无,停止=无):
#可以选择设置要关注的内容的时间界限
stop=stop或pd.to_datetime(datetime.now())
最近=dataframe.loc[逻辑_和(开始
如果有人能想出更好的方法来做这件事,我洗耳恭听,但这确实比到处编字典感觉好多了