Python 根据列条件删除相等数量的行_Python_Pandas

Python 根据列条件删除相等数量的行

python pandas

Python 根据列条件删除相等数量的行,python,pandas,Python,Pandas,我试图减小数据帧的大小，并且需要保持每个类（标签）的数量相等。如何根据列“label”删除相同数量的行。换句话说，我需要在生成的数据帧中有一个相等的类标签分布我有以下数据帧： pd.DataFrame([{'label': 0, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0}, {'label': 1, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0}, {'label':

我试图减小数据帧的大小，并且需要保持每个类（标签）的数量相等。如何根据列“label”删除相同数量的行。换句话说，我需要在生成的数据帧中有一个相等的类标签分布

我有以下数据帧：

    pd.DataFrame([{'label': 0, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 1, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 2, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 3, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 1},
 {'label': 4, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 5, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 5},
 {'label': 6, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 7, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 8, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 9, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 0, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 1, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 2, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 3, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 1},
 {'label': 4, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 5, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 5},
 {'label': 6, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 7, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 8, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0},
 {'label': 9, 'pixel1': 0, 'pixel2': 0, 'pixel3': 0, 'pixel4': 0}])

结果数据帧将有10行，每行有一个unqiue标签。我需要这个答案以适用于具有1000行的较大数据集。

如果您想要每个“标签”组中的第一条记录

df.groupby('label').head(1)

df.groupby('label', as_index=False).apply(lambda x: x.sample(1)).reset_index(drop=True)

输出：

   label  pixel1  pixel2  pixel3  pixel4
0      0       0       0       0       0
1      1       0       0       0       0
2      2       0       0       0       0
3      3       0       0       0       1
4      4       0       0       0       0
5      5       0       0       0       5
6      6       0       0       0       0
7      7       0       0       0       0
8      8       0       0       0       0
9      9       0       0       0       0

   label  pixel1  pixel2  pixel3  pixel4
0      0       0       0       0       0
1      1       0       0       0       0
2      2       0       0       0       0
3      3       0       0       0       1
4      4       0       0       0       0
5      5       0       0       0       5
6      6       0       0       0       0
7      7       0       0       0       0
8      8       0       0       0       0
9      9       0       0       0       0

或者，您可以从每个“标签”组中随机获取记录

df.groupby('label').head(1)

df.groupby('label', as_index=False).apply(lambda x: x.sample(1)).reset_index(drop=True)

输出：

   label  pixel1  pixel2  pixel3  pixel4
0      0       0       0       0       0
1      1       0       0       0       0
2      2       0       0       0       0
3      3       0       0       0       1
4      4       0       0       0       0
5      5       0       0       0       5
6      6       0       0       0       0
7      7       0       0       0       0
8      8       0       0       0       0
9      9       0       0       0       0

   label  pixel1  pixel2  pixel3  pixel4
0      0       0       0       0       0
1      1       0       0       0       0
2      2       0       0       0       0
3      3       0       0       0       1
4      4       0       0       0       0
5      5       0       0       0       5
6      6       0       0       0       0
7      7       0       0       0       0
8      8       0       0       0       0
9      9       0       0       0       0

你能行

yourDataFrame.drop_duplicates('label')

创建

df

后，它有20行，每个

标签出现两次
因此，为了将每一行保留一次（不重复），您可以
使用：drop\u duplicates
和subset
='label'

df.drop_duplicates(subset='label', inplace=True); df

编辑
但是如果您有不同的行数，并且具有相同的标签

（在每个行中，使用相同标签的组），您必须采取其他方法：
从计算每个标签出现的次数开始：

df.groupby('label').size()
我们还想知道最小组数：

minGrpCnt = df.groupby('label').size().min()
为了不丢失任何组，可以从中删除minGrpCnt-1行每组
要查找这些行，可以使用
cumcount
函数，对每组中的行进行编号，从0开始
例如，如果
minGrpCnt
=2，则可以使用
cumcount（）=0
（仅每组中的第一行）
一般情况下，我们对具有
cumcount（）
的行感兴趣我们必须找到这些行的索引（df[].index）并删除具有这些索引的行总之，执行任务的命令是： df.drop(df[df.groupby('label').cumcount() < minGrpCnt - 1].index, inplace=True) df.drop（df[df.groupby（'label'）.cumcount（）您能澄清一下您的问题吗？我不明白你所说的“我需要为每个标签（0-9）删除相同数量的行”是什么意思。我需要在恢复数据集中均匀分布类标签。两种解决方案都有效，我更喜欢第二种。谢谢@Briainodonell请确保单击复选框以表明此人已回答您的问题