Python 在Pandas中的多个列中检查NaN
我想根据给定列是否包含NaN向数据帧添加一个二进制列 我试着用下面的代码来做Python 在Pandas中的多个列中检查NaN,python,pandas,dataframe,nan,Python,Pandas,Dataframe,Nan,我想根据给定列是否包含NaN向数据帧添加一个二进制列 我试着用下面的代码来做 import pandas as pd dat = pd.DataFrame({'A': [12,34,56,78, 23,None, None], 'B': [90,80,70,23,None, 78, None], 'C': [90,80,70,23,None, 78, None], 'D': [12,34,56,78, 23,None, None]}) dat['A1'] = dat['A'].isnull()
import pandas as pd
dat = pd.DataFrame({'A': [12,34,56,78, 23,None, None], 'B': [90,80,70,23,None, 78, None], 'C': [90,80,70,23,None, 78, None], 'D': [12,34,56,78, 23,None, None]})
dat['A1'] = dat['A'].isnull()
dat['B1'] = dat['B'].isnull()
dat['C1'] = dat['C'].isnull()
dat['ismissing'] = 1 if dat['A1'] == True and dat['B1'] == True and dat['C1'] == True else 0
dat
但我前天在电话里有个错误
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
样本输入:
A B C D
10 NaN 40 NaN
NaN NaN 80 90
20 45 NaN 89
NaN NaN NaN 46
预期产出:
A B C D E
10 NaN 40 NaN 0
NaN NaN 80 90 0
20 45 NaN 89 0
NaN NaN NaN 46 1
我只想检查A、B、C列的NaN。注意,
和
需要一个布尔变量,而pd.Series
不是。这就是为什么python抱怨它不知道如何将pd.Series
转换为布尔值
相反,你可以(也应该)做:
您要检查具有列(
a、B、C
)的行是否具有全部nan
您可以使用以下方法执行此操作:
性能比较:
广亨的回答是:
In [1720]: %timeit df['ismissing'] = df[['A','B','C']].isna().all(axis=1)
989 µs ± 70 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1719]: %timeit df['New']=~df.index.isin(df.drop('D',1).dropna(thresh=1).index)
2.05 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1724]: %timeit df['all_nan'] = df[['A','B','C']].count(axis=1).eq(0).view('i1')
1.48 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
尤本尤的回答是:
In [1720]: %timeit df['ismissing'] = df[['A','B','C']].isna().all(axis=1)
989 µs ± 70 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1719]: %timeit df['New']=~df.index.isin(df.drop('D',1).dropna(thresh=1).index)
2.05 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1724]: %timeit df['all_nan'] = df[['A','B','C']].count(axis=1).eq(0).view('i1')
1.48 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
安基的回答是:
In [1720]: %timeit df['ismissing'] = df[['A','B','C']].isna().all(axis=1)
989 µs ± 70 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1719]: %timeit df['New']=~df.index.isin(df.drop('D',1).dropna(thresh=1).index)
2.05 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [1724]: %timeit df['all_nan'] = df[['A','B','C']].count(axis=1).eq(0).view('i1')
1.48 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我的答覆是:
In [1723]: %timeit dat['E'] = np.where(dat[['A','B','C']].isnull().all(1), 1, 0)
914 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
如你所见,我的答案是np,其中是最快的。让我们试试新的
df['New']=~df.index.isin(df.drop('D',1).dropna(thresh=1).index)
df
A B C D New
0 10.0 NaN 40.0 NaN False
1 NaN NaN 80.0 90.0 False
2 20.0 45.0 NaN 89.0 False
3 NaN NaN NaN 46.0 True
我创建了一个包含true和false的列,如果为true,则应用一个,如果为false,则应用0
dat['ismissing'] = dat.isnull().all(axis=1)
dat['ismissing'] = dat['ismissing'].apply(lambda x: 1 if x else 0)
谢谢@anky。我试图寻找一个优化的解决方案。@anky我进一步检查,实际上
isna()
和isnull()
在性能上没有太大的差异。主要变化是由于np.其中
。