Pandas 熊猫多标签不平衡数据集的欠采样_Pandas_Data Science_Imblearn

Pandas 熊猫多标签不平衡数据集的欠采样

pandas

Pandas 熊猫多标签不平衡数据集的欠采样,pandas,data-science,imblearn,Pandas,Data Science,Imblearn,我正在开发一个滚动您自己的欠采样功能，因为imblearn不能很好地处理多标签分类（例如，它只接受一维y）我想迭代X和y，每2或3行删除一行，这是大多数类的一部分。目标是一种快速而肮脏的方法来减少大多数类中的行数 def undersample(X, y): counter = 0 for index, row in y.itertuples(): if row['rectangle_here'] == 0: counter += 1

我正在开发一个滚动您自己的欠采样功能，因为

imblearn

不能很好地处理多标签分类（例如，它只接受一维

）

我想迭代X和y，每2或3行删除一行，这是大多数类的一部分。目标是一种快速而肮脏的方法来减少大多数类中的行数

def undersample(X, y):
    counter = 0
    for index, row in y.itertuples():
        if row['rectangle_here'] == 0:
            counter += 1
            if counter > 3:
                counter = 0
                X.drop(index, inplace=True)
                y.drop(index, inplace=True)
    return X, y

但它甚至会在少量行（约30000行）上崩溃我的内核

是这样的，只要出现

f2

或

f3

，就会出现

f1

因此，让我们计算0在

f1

中发生的次数，然后每三次删除一行0：

                  f1      f2       f3
0                  0       0       0
1                  0       0       0
2                  0       0       0
3                  1       0       1
4                  0       0       0
5                  0       0       0
6                  0       0       0
7                  0       0       0
8                  0       0       0
9                  0       0       0

首先是：

new_index=df[df.f1==0][：：3].index.append（df[df.f1==0][1:：3].index）.append（df[df.f1==1].index）

然后是：

df.loc[new_index].sort_index（）

谢谢@JohnE，不过这对我不起作用。不知道为什么。它甚至没有失败——它似乎什么都没做！也许我们需要一个

.drop

之后…？您可能需要

df=df.loc？