Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/339.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pandas-结合数据帧中的两行-带条件_Python_Datetime_Pandas - Fatal编程技术网

Python Pandas-结合数据帧中的两行-带条件

Python Pandas-结合数据帧中的两行-带条件,python,datetime,pandas,Python,Datetime,Pandas,我有一个熊猫数据框,看起来像: A B C Stime Etime 1220627 a 10.0 18:00:00 18:09:59 1220627 a 12.0 18:15:00 18:26:59 1220683 b 3.0 18:36:00 18:38:59 1220683 a 3.0 18:36:00 18:38:59 1220732 a 59.0 18:00:00 18:58:59 1220760 A 16.0 18:2

我有一个熊猫数据框,看起来像:

A       B     C    Stime    Etime    
1220627 a   10.0 18:00:00 18:09:59
1220627 a   12.0 18:15:00 18:26:59
1220683 b   3.0  18:36:00 18:38:59
1220683 a   3.0  18:36:00 18:38:59
1220732 a   59.0 18:00:00 18:58:59
1220760 A   16.0 18:24:00 18:39:59
1220760 a   16.0 18:24:00 18:39:59
1220760 A   19.0 18:40:00 18:58:59
1220760 b   19.0 18:40:00 18:58:59
1220760 a   19.0 18:40:00 18:58:59
1220775 a   3.0  18:03:00 18:05:59
Stime和Etime col来自datetime类型

C是时间与时间之间的分钟数

A列是家庭ID,B列是家庭中的个人ID

(因此cols A和B一起代表一个独特的人)

我需要做的是更新表格,这样,对于某个人来说,如果时间正好在结束时间之后,我将把2行单位化,然后更新C

例如,对于HH
1220760
中的person
a
,第一个
Etime
18:39:59

第二个
Stime
18:40:00
,正好在18:39:59之后,所以我想把这些行单位化,并将这个人的C更新为
35
(16+19)


我试图使用
groupby
,但我不知道如何添加这样一个条件,即
Stime
将在
Etime

之后立即出现
Etime
如果我们在
Etime
上加上一秒钟,那么我们就可以通过
['A',B']分组找到要连接的行
,然后对每组比较移位的
时间
s与下一个
时间

df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
#           A  B     C               Etime               Stime   keep
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True
我们希望保留
keep
为True的行,并删除
keep
为False的行, 除此之外,我们还希望根据需要更新
Etime
s

如果我们能给每一行分配一个“组号”,这样我们就可以按
['a','B','group\u number']
进行分组,那就太好了,事实上我们可以。我们需要做的就是将
cumsum
应用到
keep
列:

df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
#           A  B     C               Etime               Stime   keep  group_number
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True           2.0
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True           1.0
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           1.0
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           1.0
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           2.0
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           2.0
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False           2.0
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True           1.0
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True           1.0
现在可以通过按
['A','B','group\u number']
分组来找到所需的结果, 并找出各组的最小
时间
和最大
时间

result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})

                                     Stime               Etime
A       B group_number                                        
1220627 a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
        b 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0          2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0          2016-05-29 18:24:00 2016-05-29 18:59:00
        a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:24:00 2016-05-29 19:10:00
        b 1.0          2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0          2016-05-29 18:03:00 2016-05-29 18:06:00

总而言之

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
           1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
     'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'], 
     'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0], 
     'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
               '18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00', 
               '18:03:00', '18:59:00', '18:00:00'],
     'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
               '18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59', 
               '18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
    df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)
屈服

         A  B     C               Stime               Etime
0  1220627  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1  1220627  a  12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2  1220683  a   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3  1220683  b   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4  1220732  a  59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5  1220760  A  35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6  1220760  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7  1220760  a  46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8  1220760  b  19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9  1220775  a   3.0 2016-05-29 18:03:00 2016-05-29 18:06:00

使用
[开始,结束)
当两个间隔相邻时, 一个的
结束
等于下一个的
开始

另一个优点是半开放时间间隔的分钟数等于
end start
。当间隔完全关闭时,公式变为
end start+1

Python的内置
范围
和列表切片语法使用半开区间
建议在数据帧中使用半开间隔
[Stime,Etime)

也一样。

如果我们给
时间加上一秒钟,我们可以通过
['A',B']
分组找到要连接的行,然后对每组比较移动的
时间与下一个
时间

df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
#           A  B     C               Etime               Stime   keep
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True
我们希望保留
keep
为True的行,并删除
keep
为False的行, 除此之外,我们还希望根据需要更新
Etime
s

如果我们能为每一行分配一个“组号”,这样我们就可以按
['a','B','groupu\u number']
进行分组,那就太好了。事实上我们可以。我们所需要做的就是将
cumsum
应用到
keep
列:

df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
#           A  B     C               Etime               Stime   keep  group_number
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True           2.0
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True           1.0
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           1.0
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           1.0
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           2.0
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           2.0
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False           2.0
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True           1.0
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True           1.0
现在可以通过按
['A','B','group\u number']
分组来找到所需的结果, 并找出各组的最小
时间
和最大
时间

result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})

                                     Stime               Etime
A       B group_number                                        
1220627 a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
        b 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0          2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0          2016-05-29 18:24:00 2016-05-29 18:59:00
        a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:24:00 2016-05-29 19:10:00
        b 1.0          2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0          2016-05-29 18:03:00 2016-05-29 18:06:00

总而言之

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
           1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
     'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'], 
     'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0], 
     'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
               '18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00', 
               '18:03:00', '18:59:00', '18:00:00'],
     'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
               '18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59', 
               '18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
    df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)
屈服

         A  B     C               Stime               Etime
0  1220627  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1  1220627  a  12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2  1220683  a   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3  1220683  b   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4  1220732  a  59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5  1220760  A  35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6  1220760  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7  1220760  a  46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8  1220760  b  19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9  1220775  a   3.0 2016-05-29 18:03:00 2016-05-29 18:06:00

使用
[开始,结束)
当两个间隔相邻时, 一个的
结束
等于下一个的
开始

另一个优点是半开放时间间隔的分钟数等于
end start
。当间隔完全关闭时,公式变为
end start+1

Python的内置
范围
和列表切片语法使用半开区间
建议在数据帧中使用半开间隔
[Stime,Etime)

也是。

这种方法怎么样

In [68]: df.groupby(['A','B', df.Stime - df['Etime'].shift() <= pd.Timedelta('1S')], as_index=False)['C'].sum()
Out[68]:
         A  B     C
0  1220627  a  22.0
1  1220683  a   3.0
2  1220683  b   3.0
3  1220732  a  59.0
4  1220760  A  35.0
5  1220760  a  35.0
6  1220760  b  19.0
7  1220775  a   3.0

[68]中的df.groupby(['A','B',df.Stime-df['Etime'].shift()这种方法怎么样

In [68]: df.groupby(['A','B', df.Stime - df['Etime'].shift() <= pd.Timedelta('1S')], as_index=False)['C'].sum()
Out[68]:
         A  B     C
0  1220627  a  22.0
1  1220683  a   3.0
2  1220683  b   3.0
3  1220732  a  59.0
4  1220760  A  35.0
5  1220760  a  35.0
6  1220760  b  19.0
7  1220775  a   3.0

In[68]:df.groupby(['A','B',df.Stime-df['Etime'].shift()好的,我想我有一个解决方案,但它非常粗糙,我相信有人可以改进它

假设
df=
您提供的上述数据:

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S') # needs to be converted to datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S') # needs to be converted to datetime

df = df.sort_values(['A','B','Stime']) # data needs to be sorted by unique person : Stime
df = df.reset_index(drop=True)
df = df.reset_index() 

def new_person(row):
    if row.name > 0:
        if row['A'] != df.ix[row.name-1][1] or row['B'] != df.ix[row.name-1][2]:
            return 'Yes'

def update(row):
    if row.name > 0:
        if row['B'] == df.ix[row.name-1][2]:
            if df.ix[row.name][4] - df.ix[row.name-1][5] >= pd.Timedelta(seconds=0) and df.ix[row.name][4] - df.ix[row.name-1][5] < pd.Timedelta(seconds=2):
                return df.groupby(['A','B'])['C'].cumsum().ix[row.name]

def rewrite(row):
    if row['update'] > 0:
        return row['update']
    else:
        return row['C']

df['new_person'] = df.apply(new_person, axis=1) # adds column where value = 'Yes' if person is not the same as row above
df['update'] = df.apply(update,axis=1) # adds a column 'update' to allow for a cumulative sum rewritten to 'C' in rewrite function
print df

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S').dt.time # removes date from datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S').dt.time # removes date from datetime
df['C'] = df.apply(rewrite,axis=1) # rewrites values for 'C' column

# hacky way of combining idxmax and indices of rows where the person is 'new'
updated = df.groupby(['A','B'])['C'].agg(pd.Series.idxmax).values
not_updated = df['new_person'].isnull().tolist()

combined = [x for x in df.index if (x in updated or x in not_updated)]

df = df.iloc[combined]
df = df.drop(['new_person','update','index'],axis=1)
print df

好的,我想我有一个解决方案,但它是非常粗糙的,我相信有人可以改进它

假设
df=
您提供的上述数据:

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S') # needs to be converted to datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S') # needs to be converted to datetime

df = df.sort_values(['A','B','Stime']) # data needs to be sorted by unique person : Stime
df = df.reset_index(drop=True)
df = df.reset_index() 

def new_person(row):
    if row.name > 0:
        if row['A'] != df.ix[row.name-1][1] or row['B'] != df.ix[row.name-1][2]:
            return 'Yes'

def update(row):
    if row.name > 0:
        if row['B'] == df.ix[row.name-1][2]:
            if df.ix[row.name][4] - df.ix[row.name-1][5] >= pd.Timedelta(seconds=0) and df.ix[row.name][4] - df.ix[row.name-1][5] < pd.Timedelta(seconds=2):
                return df.groupby(['A','B'])['C'].cumsum().ix[row.name]

def rewrite(row):
    if row['update'] > 0:
        return row['update']
    else:
        return row['C']

df['new_person'] = df.apply(new_person, axis=1) # adds column where value = 'Yes' if person is not the same as row above
df['update'] = df.apply(update,axis=1) # adds a column 'update' to allow for a cumulative sum rewritten to 'C' in rewrite function
print df

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S').dt.time # removes date from datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S').dt.time # removes date from datetime
df['C'] = df.apply(rewrite,axis=1) # rewrites values for 'C' column

# hacky way of combining idxmax and indices of rows where the person is 'new'
updated = df.groupby(['A','B'])['C'].agg(pd.Series.idxmax).values
not_updated = df['new_person'].isnull().tolist()

combined = [x for x in df.index if (x in updated or x in not_updated)]

df = df.iloc[combined]
df = df.drop(['new_person','update','index'],axis=1)
print df

B列中的字母大小写重要吗?A与A相同吗?不,不一样。大小写重要。B列中的字母大小写重要吗?A与A相同吗?不,不一样。大小写重要。如果要合并的行数超过2行,例如:
1220760 A 16.0 18:24:00 18:39:59
1220760 A 19.0 18:40:00 18:58:59
1220760 a 11.0 18:59:00 19:09:59
如果要合并的行数超过2行,例如:
1220760 a 16.0 18:24:00 18:39:59
1220760 a 19.0 18:40:00 18:58:59
1220760 a 11.0 18:59:00 19:09:59