Python 以最快的方式删除行/获取与大数据帧不同的子集问题:_Python_Pandas_Dataframe

Python 以最快的方式删除行/获取与大数据帧不同的子集问题:

python pandas dataframe

Python 以最快的方式删除行/获取与大数据帧不同的子集问题:,python,pandas,dataframe,Python,Pandas,Dataframe,我正在寻找一种最快的方法，从一个大数据帧中删除一组我已经得到的索引行，或者获取这些索引的差异子集（结果是相同的数据集）到目前为止，我有两个解决方案，对我来说似乎相对缓慢： df.loc[df.difference（index）] 在我的数据集上大约需要115秒 df.drop（索引）在我的数据集上大约需要215秒有没有更快的方法？最好是熊猫拟议解决方案的绩效 ~41秒：df[~df.index.isin（索引）] 我相信您可以创建布尔掩码，通过~反转和过滤：如@user34718

我正在寻找一种最快的方法，从一个大数据帧中删除一组我已经得到的索引行，或者获取这些索引的差异子集（结果是相同的数据集）

到目前为止，我有两个解决方案，对我来说似乎相对缓慢：

df.loc[df.difference（index）]

在我的数据集上大约需要115秒

df.drop（索引）

在我的数据集上大约需要215秒

有没有更快的方法？最好是熊猫

拟议解决方案的绩效

~41秒：
```
df[~df.index.isin（索引）]
```

我相信您可以创建布尔掩码，通过

反转和过滤：

如@user3471881所述，如果您计划操作过滤后的

df

，则需要添加

copy

，以避免链式索引：

df1 = df[~df.index.isin(indices)].copy()

这种过滤取决于匹配索引的数量以及数据帧的长度

因此，另一种可能的解决方案是创建索引的

数组/列表

，以便保留索引，然后不需要反转：

df1 = df[df.index.isin(need_indices)]

使用（或

loc

，见下文）和：

正如@jezrael所指出的，如果

索引

是一个索引，则只能使用

iloc

，否则必须使用

loc

。但这仍然比

df[df.isin（）]

快（请参见下面的原因）

1000万行上的所有三个选项：

df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

为什么超慢速
loc
的性能优于
boolean\u索引
？

简言之，答案是没有

df.index.drop（index）

比

~df.index.isin（index）

快得多（上面给出的数据有1000万行）：

我们可以将其与

boolean\u索引

iloc

loc

的性能进行比较：

boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

如果行的顺序不介意，您可以将它们安排到位：

n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
    k=len(values)-1
    for ind in indices[::-1]:
            values[ind]=values[k]
            k-=1

def dropfew(df,indices):
    _dropfew(df.values,indices)
    return df.iloc[:len(df)-len(indices)]

运行：

In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms

如何工作

df[~df.index.isin（index）]

？uuuh nice:D 41秒。JeZRAEL，我同意你的解决方案是最快的，但是一个是考虑你想删除多少个索引。例如，如果

indices2remove

大于

indices2keep

df[df.index.isin（indices2keep）]它将比

df[df.index.isin（indices2remove）]快

您想要丢弃的数据量是多少？对于渐近情况，可以找到优化算法。@jezrael Nope。使用

df.loc[]

的速度要慢3倍。请注意，如果您将其保存在一个新变量中，您必须

复制过滤后的df
，如果您计划稍后操作过滤后的df
。不知道这将如何影响性能，但值得注意。hmmm，iloc可能仅用于默认范围索引，因此不确定是否可能用于OP解决方案：）@用户3471881我稍后会查看您的解决方案。添加了一些额外的上下文，解释为什么在这种情况下，loc
优于boolean\u index
。投票支持速度，但OP要求只使用pandas
解决方案。据我所知，这需要numba，至少应该明确提及。（不仅仅是通过导入模块：D）@user3471881:OP表示更好；）。但这会搅乱争吵。
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
    k=len(values)-1
    for ind in indices[::-1]:
            values[ind]=values[k]
            k-=1

def dropfew(df,indices):
    _dropfew(df.values,indices)
    return df.iloc[:len(df)-len(indices)]

In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms