Python 熊猫：删除连续的重复项_Python_Pandas

Python 熊猫：删除连续的重复项

python pandas

Python 熊猫：删除连续的重复项,python,pandas,Python,Pandas,在大熊猫身上，最有效的方法是什么 drop_duplicates提供了以下功能： In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5]) In [4]: a.drop_duplicates() Out[4]: 1 1 2 2 4 3 dtype: int64 但我想要这个： In [4]: a.something() Out[4]: 1 1 2 2 4 3 5 2 dtype: int6

在大熊猫身上，最有效的方法是什么

drop_duplicates提供了以下功能：

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

但我想要这个：

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

使用：

因此，上面使用布尔临界值，我们将数据帧与移位-1行的数据帧进行比较，以创建掩码

另一种方法是使用：

但是，如果您有大量行，那么这比原始方法要慢

更新

感谢Bjarke Ebert指出了一个细微的错误，我实际上应该使用

shift（1）

或只使用

shift（）

，因为默认值是1，这将返回第一个连续值：

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

请注意索引值的差异，谢谢@BjarkeEbert

这里有一个更新，它可以处理多个列。使用“.any（axis=1）”组合每列的结果：

cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]

由于我们追求的是

最有效的方法

，即性能，所以让我们使用数组数据来利用NumPy。我们将一次性切片并进行比较，类似于前面在

@EdChum的帖子中讨论的移位方法。但是使用NumPy切片，我们最终会少得到一个数组，因此我们需要在开始时连接一个True
元素来选择第一个元素，因此我们将有一个这样的实现-
def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

样本运行-
In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

大型阵列上的计时比较-
所以，有一些改进
仅为价值观获得重大提升
如果只需要这些值，我们可以通过简单地索引到数组数据中来获得巨大的提升，就像这样-
def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

样本运行-
In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

时间安排-
In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop

In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop

对于其他堆栈探索者，根据上面的johnml1135答案进行构建。这将从多个列中删除下一个副本，但不会删除所有列。当对数据帧进行排序时，它将保留第一行，但如果“cols”匹配，则删除第二行，即使有更多的列具有不匹配的信息
cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]

这是一个处理pd.Series
和pd.Dataframes
的函数。您可以屏蔽/删除，选择轴，最后选择使用“any”或“all”“NaN”删除。它没有在计算时间方面进行优化，但它的优点是健壮且非常清晰
import numpy as np
import pandas as pd

# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                             keep_first=True,
                                             axis=0, how='all'):

    '''
    #Function built with the help of:
    # 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
    # 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
    
    Input:
    df should be a pandas.DataFrame of a a pandas.Series
    Output:
    df of ts with masked or droped values
    '''
    
    # Mask keeping the first occurence
    if keep_first:
        df = df.mask(df.shift(1) == df)
    # Mask including the first occurence
    else:
        df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))

    # Drop the values (e.g. rows are deleted)    
    if drop:
        return df.dropna(axis=axis, how=how)        
    # Only mask the values (e.g. become 'NaN')
    else:
        return df   

以下是要包含在脚本中的测试代码：

if __name__ == "__main__":
    
    # With time series
    print("With time series:\n")
    ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')], 
                    index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
    
    print("#Original ts:")    
    print(ts)

    print("\n## 1) Mask keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=True))

    print("\n## 2) Mask including the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=False))
    
    print("\n## 3) Drop keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=True))
    
    print("\n## 4) Drop including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=False))
    
    
    # With dataframes
    print("With dataframe:\n")
    df = pd.DataFrame(np.random.randn(15, 3))
    df.iloc[4:9,0]=40
    df.iloc[8:15,1]=22
    df.iloc[8:12,2]=0.23
        
    print("#Original df:")
    print(df)

    print("\n## 5) Mask keeping the first occurence:") 
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=True))

    print("\n## 6) Mask including the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=False))
    
    print("\n## 7) Drop 'any' keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='any'))
    
    print("\n## 8) Drop 'all' keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='all'))
    
    print("\n## 9) Drop 'any' including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='any'))

    print("\n## 10) Drop 'all' including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='all'))

以下是预期结果：
With time series:

#Original ts:
0     1.0
1     1.0
2     2.0
3     2.0
4     3.0
5     2.0
6     6.0
7     6.0
8     NaN
9     6.0
10    6.0
11    NaN
12    NaN
dtype: float64

## 1) Mask keeping the first occurence:
0     1.0
1     NaN
2     2.0
3     NaN
4     3.0
5     2.0
6     6.0
7     NaN
8     NaN
9     6.0
10    NaN
11    NaN
12    NaN
dtype: float64

## 2) Mask including the first occurence:
0     NaN
1     NaN
2     NaN
3     NaN
4     3.0
5     2.0
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
dtype: float64

## 3) Drop keeping the first occurence:
0    1.0
2    2.0
4    3.0
5    2.0
6    6.0
9    6.0
dtype: float64

## 4) Drop including the first occurence:
4    3.0
5    2.0
dtype: float64
With dataframe:

#Original df:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5   40.000000  -0.470958 -0.339213
6   40.000000   1.613524  0.271641
7   40.000000  -1.810958 -1.568372
8   40.000000  22.000000  0.230000
9   -0.296557  22.000000  0.230000
10  -0.921238  22.000000  0.230000
11  -0.170195  22.000000  0.230000
12   1.460457  22.000000 -0.295418
13   0.307825  22.000000 -0.759131
14   0.287392  22.000000  0.378315

## 5) Mask keeping the first occurence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 6) Mask including the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
8        NaN       NaN       NaN
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

## 7) Drop 'any' keeping the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4  40.000000  1.889781 -1.394573

## 8) Drop 'all' keeping the first occurence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 9) Drop 'any' including the first occurence:
          0         1         2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742  1.891365
2  1.009388  0.589445  0.927405
3  0.212746 -0.392314 -0.781851

## 10) Drop 'all' including the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315


只是另一种方式：
a.loc[a.ne(a.shift())]

方法pandas.Series.ne
是不相等运算符，因此a.ne（a.shift（））
相当于a！=a、 shift（）
。文档。
这里有一个变体，它也将连续的NAN视为重复：
def删除连续的重复项：
#默认情况下，`shift`使用NaN作为填充值，这会中断
#删除连续的NAN。因此我们使用不同的哨兵
#反对。
shift=s.astype（object）.shift（-1，fill_value=object（））
返回s.loc[
（移位！=s）
&~（shift.isna（）&s.isna（））
]
创建新列
df['match'] = df.col1.eq(df.col1.shift())

然后：
我不明白为什么[147]和[175]的时间不同？你能解释一下你做了什么改变吗？因为我没有看到任何改变？也许是输入错误？@Biarys[175]
仅用于值的修改获取主要提升第一节，因此时间差。最初的一个在熊猫系列上工作，而修改后的一个在阵列上也在文章中列出。哦，我明白了。很难注意到从返回a[…]
到返回ar[…]
的变化。您的函数适用于数据帧吗？@Biarys对于数据帧，如果您要查找重复的行，我们只需使用切片：ar[：，：-1]！=ar[：，1::
，以及全部
减少。谢谢。我会尽量避免显式检查值如果keep_first:
足够（更好的样式）我们应该怎么做，如果我们想先做一个groupby，然后删除连续的重复项？例如df.groupby（['Col1'，'Col2']），并将其再次保存为数据帧？