Python 当列B为nan时，删除列A的重复数据帧_Python_Pandas_Dataframe

Python 当列B为nan时，删除列A的重复数据帧

python pandas dataframe

Python 当列B为nan时，删除列A的重复数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,所以情况是我得到了一个数据帧[a，B，C，D]，当a，B两列都不是nan（a不能是nan）时，不要删除任何东西，但是当我们有[a，B]的一些组合时，仍然得到另一行a不是空的，B是空的。然后需要删除这样的行。另一种情况是，当我们没有任何AB组合时，我们只得到一行A不为空，但B为空，此时，无法删除重复项 e、 g 因此，在这种情况下，不应删除第一列和第二列，而应删除第三列，因为Tom（列A中已存在）和列B中的值为Nan，所以应删除第三列另一种情况是 A B C D [Jack,

所以情况是我得到了一个数据帧[a，B，C，D]，当a，B两列都不是nan（a不能是nan）时，不要删除任何东西，但是当我们有[a，B]的一些组合时，仍然得到另一行a不是空的，B是空的。然后需要删除这样的行。另一种情况是，当我们没有任何AB组合时，我们只得到一行A不为空，但B为空，此时，无法删除重复项

e、 g

因此，在这种情况下，不应删除第一列和第二列，而应删除第三列，因为Tom（列A中已存在）和列B中的值为Nan，所以应删除第三列

另一种情况是

 A    B     C   D
[Jack, Nan, fish, dolphine]

在整个数据帧中，我们只有一行A列中的值为Jack，因此无论B是否为nan，我们都不会删除此列。

这是我找到的解决方案：

is_na = df['B'].isna() #This transformation (NaN -> True/False) is necessary to count

new_df = df[is_na].filter(['A'])
new_df['B'] = is_na #new_df has column A and column B with Trues and Falses
counting_nans = new_df.groupby('A')['B'].count()

counting_nans具有按A列的值分组的nan数：

在uniques中，我们将存储所有必须计算的值

uniques = df['A'].value_counts()

现在，让我们过滤掉这个。如果一个值出现在列“a”中的次数等于列“B”中的NaN数，则不应删除这些行，如果它在“a”中只出现一次，我们也可以将其删除（无论在该特定行df['B']是否为NaN）
希望有帮助！如果我的代码不清楚，请告诉我。
另外，我很确定有一种更简单的方法，我对编程xD有点陌生，您可以使用一行代码实现您想要的结果：

df = df[df.apply(lambda row: not((row['B'] is np.nan) & (len(df[df['A'] == row[dup_col]]) > 1)), axis=1)]

细节这里的解决方案是与python函数结合使用
安装程序
将熊猫作为pd导入将numpy作为np导入数据={ ‘A’：[‘汤姆’、‘汤姆’、‘汤姆’、‘杰克’]， ‘B’：[‘简’、‘扎克’、np.nan、np.nan]， ‘C’：[‘简’、‘熊’、‘猫’、‘熊’]， ‘D’：[‘简’、‘熊’、‘猫’、‘熊’]， } #创建数据帧 df=pd.DataFrame（数据） #设置列以检查重复和np.nan dup_col='A' nan_col='B' #过滤前打印df 打印（df.head（））
用于沿轴将函数应用于，指定
轴=1
将函数应用于每行

该函数允许我们使用row变量

内部条件是您定义为重复的条件

i、 e.“B”列为Nan，“A”列为重复项

我把它分成多行，以便更容易理解，但实际上可以在一行上完成

df=df[ df.应用（lambda行：不是( （行[nan\u col]是np.nan）和（len（df[df[dup\u col]==行[dup\u col]]>1） )，轴=1） ] #过滤后打印打印（df.head（））

你有没有试过自己解决这个问题？如果是，你应该在你的帖子中包含代码，并讨论你面临的困难。我现在没有太多的想法，我想我可以使用groupby并应用一些功能。你可以发布一个小数据集，在你的问题中描述的所有情况下，显示应该删除的行和不应该删除的行。另外，发布预期结果集。只需添加一些情况来说明我的问题，请随意提问，我会不断更新我的问题
>>> counting_nans A Tom 1 Zac 1 Name: B, dtype: int64

uniques = df['A'].value_counts()

>>> uniques Tom 3 Zac 1 Name: A, dtype: int64

uniques.sort_index(inplace=True) counting_nans.sort_index(inplace=True) uniques = uniques[ uniques != counting_nans] uniques = uniques[ uniques > 1 ] condition = df['A'].isin(uniques.index) & df['B'].isna() #This is an array with Trues when df['A'] is in values to be evaluated and df['B'] is NaN index_condition = condition.loc[condition == True].index #These are the indexes df.drop(index_condition, inplace=True) #This eliminates the rows

>>> df A B C D 0 Tom Jane Cat Bear 1 Tom Jenny Monkey Tortue 3 Zac NaN Dog Penguin

df = df[df.apply(lambda row: not((row['B'] is np.nan) & (len(df[df['A'] == row[dup_col]]) > 1)), axis=1)]

A B C D 0 Tom Jane Jane Jane 1 Tom Zack Bear Bear 2 Tom NaN Cat Cat 3 Jack NaN Bear Bear

A B C D 0 Tom Jane Jane Jane 1 Tom Zack Bear Bear 3 Jack NaN Bear Bear