Python 在一列上放置重复项,断开另一列的连接
我有以下数据框:Python 在一列上放置重复项,断开另一列的连接,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有以下数据框: x = pd.DataFrame({ "item" : ["a", "a", "a", "b", "c", "c"], "vote" : [1, 0, 1, 1, 0, 0], "timestamp" : ["2020-06-07 11:04:26", "2020-06-07 11:03:37", "2020-06-07 11:09:18", "2020-06-07 11:04:40", "2020-06-07 11:09:11", "2020-06-0
x = pd.DataFrame({
"item" : ["a", "a", "a", "b", "c", "c"],
"vote" : [1, 0, 1, 1, 0, 0],
"timestamp" : ["2020-06-07 11:04:26", "2020-06-07 11:03:37", "2020-06-07 11:09:18", "2020-06-07 11:04:40", "2020-06-07 11:09:11", "2020-06-07 11:09:23"]
})
item vote timestamp
a 1 2020-06-07 11:04:26
a 0 2020-06-07 11:03:37
a 1 2020-06-07 11:09:18
b 1 2020-06-07 11:04:40
c 0 2020-06-07 11:09:11
c 0 2020-06-07 11:09:23
我如何在项目列中删除重复项,并使用timestamp
列作为分界点:保留最新的?
最终的数据帧应如下所示:
item vote timestamp
a 1 2020-06-07 11:09:18
b 1 2020-06-07 11:04:40
c 0 2020-06-07 11:09:23
您可以在删除重复项之前调用“项”和“时间戳”上的
排序\u值
:
x.sort_values(['item', 'timestamp']).drop_duplicates('item', keep='last')
item vote timestamp
2 a 1 2020-06-07 11:09:18
3 b 1 2020-06-07 11:04:40
5 c 0 2020-06-07 11:09:23
指定keep='last'
意味着除了最后一行之外的所有行都将被丢弃,这是因为我们在上一步中根据时间戳进行了排序
另一种方式
x['timestamp']=pd.to_datetime(x['timestamp'])#Coerce timestamp to datetime
x.set_index('timestamp', inplace=True)#set timestamp as index
x2=x.groupby([x.index.date,x['item']])['vote'].agg(vote='last').reset_index()
x2.columns=['timestamp','item','vote']
您需要根据时间戳进行排序,然后将带有子集的副本放到item@YOBEN_S谢谢你,我的朋友,你也是;-)保持安全和健康~
x['timestamp']=pd.to_datetime(x['timestamp'])#Coerce timestamp to datetime
x.set_index('timestamp', inplace=True)#set timestamp as index
x2=x.groupby([x.index.date,x['item']])['vote'].agg(vote='last').reset_index()
x2.columns=['timestamp','item','vote']