Python 从数据帧中消除重复条目_Python_Pandas_Dataframe_Numpy

Python 从数据帧中消除重复条目

python pandas dataframe numpy

Python 从数据帧中消除重复条目,python,pandas,dataframe,numpy,Python,Pandas,Dataframe,Numpy,我有一个熊猫数据框，看起来像这样： Date positions price 2009-03-03 buy 3.156071 2009-12-10 buy 7.015357 2010-02-02 buy 6.995000 2010-03-04 sell 7.525357 2013-09-24 buy 17.467857 2013-10-08 buy

我有一个熊猫数据框，看起来像这样：

Date       positions      price                   
2009-03-03       buy   3.156071
2009-12-10       buy   7.015357
2010-02-02       buy   6.995000
2010-03-04      sell   7.525357
2013-09-24       buy  17.467857
2013-10-08       buy  17.176428
2014-01-16       buy  19.794643
2014-01-28       buy  18.089285
2014-04-02      sell  19.376785

这只是DataFrame的一个片段，但我想做的是在positions列中只有一行带有“buy”，在positions列中包含“sell”的两行之间。换句话说，我想消除初始购买信号之后反复出现的购买信号

我想在第一个信号已经发生后消除重复的购买信号。因此，考虑到第一个数据帧，这意味着预期输出为：

Date       positions      price                 
2009-03-03       buy   3.156071
2010-03-04      sell   7.525357
2013-09-24       buy  17.467857
2014-04-02      sell  19.376785

不清楚您希望分组的

buy

是什么。我选择了

sum

，但也许你想要

mean

import pandas as pd

df = pd.DataFrame({'Date': {0: '2009-03-03',
  1: '2009-12-10',
  2: '2010-02-02',
  3: '2010-03-04',
  4: '2013-09-24',
  5: '2013-10-08',
  6: '2014-01-16',
  7: '2014-01-28',
  8: '2014-04-02'},
 'positions': {0: 'buy',
  1: 'buy',
  2: 'buy',
  3: 'sell',
  4: 'buy',
  5: 'buy',
  6: 'buy',
  7: 'buy',
  8: 'sell'},
 'price': {0: 3.156071,
  1: 7.015357000000001,
  2: 6.995,
  3: 7.5253570000000005,
  4: 17.467857000000002,
  5: 17.176428,
  6: 19.794643,
  7: 18.089285,
  8: 19.376785}})


df['g'] = (df['positions']=='sell').cumsum()
df = df.groupby(['g','positions']).sum().reset_index()
df.sort_values(by=['g','positions'], ascending=[True,False], inplace=True)

df[['positions','price']]

输出

   positions    price
0   buy     17.166428
2   sell    7.525357
1   buy     72.528213
3   sell    19.376785

不清楚您希望分组的

buy

是什么。我选择了

sum

，但也许你想要

mean

import pandas as pd

df = pd.DataFrame({'Date': {0: '2009-03-03',
  1: '2009-12-10',
  2: '2010-02-02',
  3: '2010-03-04',
  4: '2013-09-24',
  5: '2013-10-08',
  6: '2014-01-16',
  7: '2014-01-28',
  8: '2014-04-02'},
 'positions': {0: 'buy',
  1: 'buy',
  2: 'buy',
  3: 'sell',
  4: 'buy',
  5: 'buy',
  6: 'buy',
  7: 'buy',
  8: 'sell'},
 'price': {0: 3.156071,
  1: 7.015357000000001,
  2: 6.995,
  3: 7.5253570000000005,
  4: 17.467857000000002,
  5: 17.176428,
  6: 19.794643,
  7: 18.089285,
  8: 19.376785}})


df['g'] = (df['positions']=='sell').cumsum()
df = df.groupby(['g','positions']).sum().reset_index()
df.sort_values(by=['g','positions'], ascending=[True,False], inplace=True)

df[['positions','price']]

输出

   positions    price
0   buy     17.166428
2   sell    7.525357
1   buy     72.528213
3   sell    19.376785

基于@Chris解决方案修改：

df_new=pd.DataFrame(columns=df.columns)
position='sell'
for idx,row in df.iterrows():
    if row['positions']!=position:
        df_new=df_new.append(row)
        position=row['positions']
df_new

基于@Chris解决方案修改：

df_new=pd.DataFrame(columns=df.columns)
position='sell'
for idx,row in df.iterrows():
    if row['positions']!=position:
        df_new=df_new.append(row)
        position=row['positions']
df_new

谢谢大家的帮助。我已经检查了输出，我决定使用的最终代码是：

# Filter out multiple buying signals
df['g'] = (df['positions']=='sell').cumsum()
df = df.groupby(['g','positions']).first().reset_index()
df.sort_values(by=['g','positions'], ascending=[True,False], inplace=True)

谢谢大家的帮助。我已经检查了输出，我决定使用的最终代码是：

# Filter out multiple buying signals
df['g'] = (df['positions']=='sell').cumsum()
df = df.groupby(['g','positions']).first().reset_index()
df.sort_values(by=['g','positions'], ascending=[True,False], inplace=True)

我不明白你想要什么，你能举个例子吗（预期输出）？我刚刚在问题描述中添加了预期输出，但我想做的是，如果之前的条目包含购买信号，我想消除带有购买信号的条目。更具体地说，在每个包含销售信号的条目之间，我想保留第一个购买信号，并删除其余的。不明白您想要什么，您能展示示例（预期输出）吗？我刚刚将预期输出添加到问题描述中，但我想做的是，如果之前的条目包含购买信号，我想消除带有购买信号的条目。更具体地说，在每个包含卖出信号的条目之间，我想保留第一个买入信号，并删除其余的。我感觉OP想要

。first（）

而不是

。sum（）

和第二个注意事项-不要在回答中重复输入数据框定义-因此没有必要太长。谢谢你，克里斯，然而，这不是我想要的解决方案。我本应该包括开始时的预期输出，但我现在刚刚添加了它。谢谢你的帮助@杰森：如果你按照上面这位了不起的人的评论来做，这个答案符合你的需要。。。还有克里斯，回答得真不错。！谢谢你Gregorz&Adir，你说得对！谢谢@Chris，你的解决方案有很多我不知道的新功能。我感觉OP想要

。first（）

而不是

。sum（）

和第二个注意-不要在你的答案中重复输入数据帧定义-因为这不必要太长。谢谢Chris，不过这不是我想要的解决方案。我本应该包括开始时的预期输出，但我现在刚刚添加了它。谢谢你的帮助@杰森：如果你按照上面这位了不起的人的评论来做，这个答案符合你的需要。。。还有克里斯，回答得真不错。！谢谢你Gregorz&Adir，你说得对！谢谢@Chris，你的解决方案有很多我不知道的新功能。谢谢sudhish，这太棒了谢谢sudhish，这太棒了