Python 列值仍显示在.isin()之后

Python 列值仍显示在.isin()之后,python,python-3.x,pandas,dataframe,isin,Python,Python 3.x,Pandas,Dataframe,Isin,根据要求,下面是一个最小的可复制示例,它将生成.isin()问题,即不删除不在.isin()中的值,而只是将其设置为零: import os import pandas as pd df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex- Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employe

根据要求,下面是一个最小的可复制示例,它将生成.isin()问题,即不删除不在.isin()中的值,而只是将其设置为零:

import os
import pandas as pd

df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex-      Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employee', 7: 'Employee', 8: 'Ex-Employee', 9: 'Ex-Employee', 10: 'Employee', 11: 'Employee', 12: 'Ex-Employee', 13: 'Ex-Employee', 14: 'Employee', 15: 'Employee', 16: 'Employee', 17: 'Ex-Employee', 18: 'Employee', 19: 'Employee', 20: 'Ex-Employee', 21: 'Employee', 22: 'Employee', 23: 'Ex-Employee', 24: 'Employee', 25: 'Employee', 26: 'Ex-Employee', 27: 'Employee', 28: 'Employee', 29: 'Ex-Employee', 30: 'Employee', 31: 'Employee', 32: 'Ex-Employee', 33: 'Employee', 34: 'Employee', 35: 'Ex-Employee', 36: 'Employee', 37: 'Employee', 38: 'Ex-Employee', 39: 'Employee', 40: 'Employee'}, 'Years of service': {0: -0.4, 1: -0.3, 2: -0.2, 3: 1.0, 4: 1.0, 5: 1.0, 6: 2.0, 7: 2.0, 8: 2.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 3.0, 13: 4.0, 14: 4.0, 15: 4.0, 16: 5.0, 17: 5.0, 18: 5.0, 19: 5.0, 20: 6.0, 21: 6.0, 22: 6.0, 23: 11.0, 24: 11.0, 25: 11.0, 26: 16.0, 27: 17.0, 28: 18.0, 29: 21.0, 30: 22.0, 31: 23.0, 32: 26.0, 33: 27.0, 34: 28.0, 35: 31.0, 36: 32.0, 37: 33.0, 38: 35.0, 39: 36.0, 40: 37.0}, 'yos_bins': {0: 0, 1: 0, 2: 0, 3: '0-1', 4: '0-1', 5: '0-1', 6: '1-2', 7: '1-2', 8: '1-2', 9: '1-2', 10: '2-3', 11: '2-3', 12: '2-3', 13: '3-4', 14: '3-4', 15: '3-4', 16: '4-5', 17: '4-5', 18: '4-5', 19: '4-5', 20: '5-6', 21: '5-6', 22: '5-6', 23: '10-15', 24: '10-15', 25: '10-15', 26: '15-20', 27: '15-20', 28: '15-20', 29: '20-40', 30: '20-40', 31: '20-40', 32: '20-40', 33: '20-40', 34: '20-40', 35: '20-40', 36: '20-40', 37: '20-40', 38: '20-40', 39: '20-40', 40: '20-40'}})


cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df_example['yos_bins'] = pd.cut(df_example['Years of service'], bins=cut_bins, labels=cut_labels)

print(df_example['yos_bins'].value_counts())
print(len(df_example['yos_bins']))
print(len(df_example))
print(df_example['yos_bins'].value_counts())

test = df_example[df_example['yos_bins'].isin(['0-1', '1-2', '2-3'])]
print('test dataframe:\n',test)
print('\n')
print('test value counts of yos_bins:\n',     test['yos_bins'].value_counts())
print('\n')
dic_test = test.to_dict()
print(dic_test)
print('\n')
print(test.value_counts())ervr
我为“服务年限”专栏创建了垃圾箱:

然后我将.isin()应用于名为“yos_bins”的数据帧列,以筛选列值的选择。摘自df专栏

我用来切片的列称为“yos_bins”(即已分类的服务年限)。我只想选择3个范围(0-1年、1-2年、2-3年),但显然在列中包含了更多的范围

令我惊讶的是,当我应用value_counts()时,我仍然从df数据帧获取yos_bins列的所有值(但计数为0)

看起来像这样:

这不是故意的,除了isin()中的3之外,所有其他箱子都应该被丢弃。由此产生的问题是,0值显示在sns.countplot中,因此我最终得到了不希望出现的计数为零的列

当我将df保存到_excel()时,所有“10-15”值字段都显示一个“带两位数年份的文本日期”错误。我没有将该数据帧加载回python,因此不确定这是否会导致问题


有人知道我如何创建只包含3个yos_bins值的测试数据帧,而不是显示所有yos_bins值,但其中一些值为零吗?

这是一个丑陋的解决方案,因为numpy和pandas在元素方面的“is in”特征不符。根据我的经验,我使用numpy数组手动进行比较

yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
   Requesting as  Years of service yos_bins
3       Employee               1.0      0-1
4    Ex-Employee               1.0      0-1
5       Employee               1.0      0-1
6       Employee               2.0      1-2
7       Employee               2.0      1-2
8    Ex-Employee               2.0      1-2
9    Ex-Employee               2.0      1-2
10      Employee               3.0      2-3
11      Employee               3.0      2-3
12   Ex-Employee               3.0      2-3
解释 (使用x作为yos_bins,使用y作为yos_bins_sel)


x[:,None]==y[None,:])。all(1)
是主要外卖,
x[:,None]
将x从形状转换为(n,)到(n,1)<代码>y[无:]将y从形状(m,)转换为(1,m)。将它们与
==
进行比较,形成一个形状(n,m)的广播元素布尔数组,我们希望我们的数组是(n,)-形的,因此我们应用
.any(1)
,以便第二维度压缩为
,如果它的至少一个布尔值是
(如果元素在yos_bins_sel数组中)。最后是一个布尔数组,可用于屏蔽原始数据帧。将x替换为包含要比较的值的数组,将y替换为包含x值的数组,您可以对任何数据集执行此操作。

您确定创建子集的是
test=…
行吗?你能创建一个有同样问题的例子吗?最后添加了可复制的例子谢谢你,迈克。但是,当我使用df_new=df[mask]print(df_new.yos_bins.value_counts())扩展代码时,它会显示所有10个箱子,而不仅仅是您选择的三个。我不明白为什么它没有显示3个选定的箱子,而是显示另外7个带零的箱子。我希望其他的都消失。这是因为
yos_-bins
仍然保留原始数组的数据类型,以便在它们之间进行平滑操作,并且原始数据类型是一个包含所有yos_-bins类别的分类数据类型。要使
yos_bin
拥有自己的数据类型do
df_new[“yos_bin”]=df_new[“yos_bin”].astype(yos_bins_sel)
。注意:这会产生一个警告,尽管我认为它不应该出现,因为即使使用
。loc
也不会停止它,但您可以抑制它;在这里阅读更多关于它的信息
test.yos_bins.value_counts()
yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
   Requesting as  Years of service yos_bins
3       Employee               1.0      0-1
4    Ex-Employee               1.0      0-1
5       Employee               1.0      0-1
6       Employee               2.0      1-2
7       Employee               2.0      1-2
8    Ex-Employee               2.0      1-2
9    Ex-Employee               2.0      1-2
10      Employee               3.0      2-3
11      Employee               3.0      2-3
12   Ex-Employee               3.0      2-3