Python 排除数据框中的值（如果已在特定模式中出现）_Python_Pandas_Dataframe_Iteration

Python 排除数据框中的值（如果已在特定模式中出现）

python pandas dataframe

Python 排除数据框中的值（如果已在特定模式中出现）,python,pandas,dataframe,iteration,Python,Pandas,Dataframe,Iteration,我正在pycharm中使用Python3.4和pandas 我已将我的数据安排在熊猫数据框中，大致如下所示：问题在于第[15]行的“步骤”和第[16]行的“步骤”分别为8和1，对于我正在运行的分析类型而言，这种差异是不可容忍的。因此，我想排除/删除/删除第15行和值“step”返回到第[15]行的行之间的所有行，在本例中为8，可以在第[23]行找到。[在收到第一个答案后编辑]请记住，规则是任何后续值只能为+/-1。例如，第[9]行的“步骤”是4，比第[8]行的“步骤”小，第[8]行的“步骤”

我正在pycharm中使用Python3.4和pandas

我已将我的数据安排在熊猫数据框中，大致如下所示：

问题在于第[15]行的“步骤”和第[16]行的“步骤”分别为8和1，对于我正在运行的分析类型而言，这种差异是不可容忍的。因此，我想排除/删除/删除第15行和值“step”返回到第[15]行的行之间的所有行，在本例中为8，可以在第[23]行找到。[在收到第一个答案后编辑]请记住，规则是任何后续值只能为+/-1。例如，第[9]行的“步骤”是4，比第[8]行的“步骤”小，第[8]行的“步骤”是5。这种差异是允许的，任何大于+-1的差异都是不允许的

这只是一个例子，实际数据有数十万行，所以我希望在我的数据框架中不止一次出现这个问题

我一直在寻找使用for循环之类的方法来遍历行，但是有人警告我这些方法非常慢。在任何情况下，我都不能想出一个工作循环

我也没能找到一种聪明的编程方法，不用循环，只需使用pandas和某种逻辑索引就可以做到这一点。我甚至不确定，如果不进行迭代，这是否可行。现在我可以成功地找到行[I]和行[I+1]的差值大于模的所有行，并从逻辑上对其进行索引，但我仍停留在这一点上

最后，我将创建一个数据框，其中16到22行被排除在外。

如果有人发布一个较短的解决方案，我将删除它，但我发现的是创建一个

df

，它为每个步骤找到第一个

试验，如果在以前的试验中已经有一个较高的步骤，则删除它：
first_apps = temp_df.sort_values(['step', 'trials']).drop_duplicates('step')
first_apps['next_step'] = first_apps['trials'].shift(-1)
temp_df = temp_df.merge(first_apps.drop('trials', axis=1), how='left')
temp_df = temp_df[~(temp_df['trials'] > temp_df['next_step'])].drop('next_step', axis=1)

经过进一步研究，我自己的解决方案是循环和数据帧操作的组合
我首先创建了两个额外的列：一个是将每行和下一行之间的“步骤”移动一行temp_-df['shift']=temp_-df.shift（-1）
；还有一个称为跳转，它有一个简单的逻辑索引，如果新列中的任何值大于1temp_-df['jump']=temp_-df['diff']>1

然后我基本上创建了所有“跳跃”的索引，并运行for循环，其中：
1） 我提取序列中第一次跳转的索引和值（“curr_idx”和“curr_value”）
2） 我将原始数据帧的一个子集从索引一直复制到新的数据帧（“temp_df2”
3） 我在新数据帧中查找第一次跳转值的第一次出现的索引（“last_值”）
4） 我将原始数据帧中的行从第一个索引放到最后一个索引（“curr\u idx:last\u value”）
我还可以在“Try:”下执行所有操作，因为此解决方案会抛出一个我无法解决的错误。对不起
代码如下：
import pandas as pd
import matplotlib.pyplot as plt

data = {'step': [1, 2, 2, 3, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8]}
temp_df = pd.DataFrame(data=data)

temp_df['diff'] = temp_df['step'] - temp_df['step'].shift(-1)
temp_df['jump'] = temp_df['diff'] > 1
temp_df = temp_df.reset_index(drop=True)

old_df = temp_df

all_values = temp_df[temp_df['jump']]['step']

try:
    for i in range(0,len(all_values)):

        # find all positions at which jump is true
        all_values = temp_df[temp_df['jump']]['step']
        curr_idx = temp_df[temp_df['jump']].index.values.astype(int)[0]
        curr_value = all_values.iloc[0]

        temp_df2 = temp_df.drop(temp_df.index[0:curr_idx+1])
        last_value = temp_df2[temp_df2['step'] == curr_value].index.values.astype(int)[0]

        temp_df = temp_df.drop(temp_df.index[curr_idx:last_value])
        temp_df = temp_df.reset_index(drop=True)
except:
    pass

plt.subplot(121)
ax1 = plt.plot(old_df['step'])

plt.subplot(122)
ax2 = plt.plot(temp_df['step'])

这里是输出：
非常感谢您提供的解决方案，不幸的是，我没有很好地解释我的问题。参见，试验n和试验+1之间的差值应为+/-1。大于+/-1的差值是不允许的。我将在原始问题中更清楚地说明这一点。因此，第22行是可以的，不应排除在外（其步骤为7，第15行步骤为8）？在这种特殊情况下，第22行是不可以的。但是如果这个特定的规则使编码变得更困难，我可以放弃它
import pandas as pd
import matplotlib.pyplot as plt

data = {'step': [1, 2, 2, 3, 4, 4, 4, 5, 5, 4, 5, 6, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8]}
temp_df = pd.DataFrame(data=data)

temp_df['diff'] = temp_df['step'] - temp_df['step'].shift(-1)
temp_df['jump'] = temp_df['diff'] > 1
temp_df = temp_df.reset_index(drop=True)

old_df = temp_df

all_values = temp_df[temp_df['jump']]['step']

try:
    for i in range(0,len(all_values)):

        # find all positions at which jump is true
        all_values = temp_df[temp_df['jump']]['step']
        curr_idx = temp_df[temp_df['jump']].index.values.astype(int)[0]
        curr_value = all_values.iloc[0]

        temp_df2 = temp_df.drop(temp_df.index[0:curr_idx+1])
        last_value = temp_df2[temp_df2['step'] == curr_value].index.values.astype(int)[0]

        temp_df = temp_df.drop(temp_df.index[curr_idx:last_value])
        temp_df = temp_df.reset_index(drop=True)
except:
    pass

plt.subplot(121)
ax1 = plt.plot(old_df['step'])

plt.subplot(122)
ax2 = plt.plot(temp_df['step'])