Python 根据非NaN列的值有条件地填充数据帧中的NaN值_Python_Pandas

Python 根据非NaN列的值有条件地填充数据帧中的NaN值

python pandas

Python 根据非NaN列的值有条件地填充数据帧中的NaN值,python,pandas,Python,Pandas,我有一个关于有条件地在数据帧中填充NaN值的问题非NaN列的值。举例说明： import numpy as np import pandas as pd print pd.__version__ 0.18.1 df = pd.DataFrame({'a': [1, 0, 0, 0, 1], 'b': [0, 1, 0, 0, 0], 'c': [0, 0, 1, 1, 0],

我有一个关于有条件地在数据帧中填充

NaN

值的问题非

NaN

列的值。举例说明：

import numpy as np
import pandas as pd
print pd.__version__

0.18.1

df = pd.DataFrame({'a': [1, 0, 0, 0, 1],
                   'b': [0, 1, 0, 0, 0],
                   'c': [0, 0, 1, 1, 0],
                   'x': [0.5, 0.2, 0, 0.2, 0],
                   'y': [0, 0, 0, 1, 0],
                   'z': [0.1, 0.1, 0.9, 0, 0.4]})

df.ix[[2,4], ['x','y','z']] = np.nan

print df

   a  b  c    x    y    z
0  1  0  0  0.5  0.0  0.1
1  0  1  0  0.2  0.0  0.1
2  0  0  1  NaN  NaN  NaN
3  0  0  1  0.2  1.0  0.0
4  1  0  0  NaN  NaN  NaN

现在假设我有一些默认值，它们取决于前三列：

default_c = pd.Series([0.5, 0.5, 0.5], index=['x', 'y', 'z'])
default_a = pd.Series([0.2, 0.2, 0.2], index=['x', 'y', 'z'])

换句话说，我想为第2行的

NaN

值粘贴

default\u c

，并在第4行粘贴

default\u a

。为此，我提出了以下有点不雅观的解决方案：

nan_x = np.isnan(df['x'])
is_c = df['c']==1
nan_c = nan_x & is_c

print nan_c

0    False
1    False
2     True
3    False
4    False
dtype: bool

df.ix[nan_c, default_c.index] = default_c.values

print df

   a  b  c    x    y    z
0  1  0  0  0.5  0.0  0.1
1  0  1  0  0.2  0.0  0.1
2  0  0  1  0.5  0.5  0.5
3  0  0  1  0.2  1.0  0.0
4  1  0  0  NaN  NaN  NaN

使用

fillna（）

函数是否有更好的方法

例如，我猜测以下操作不起作用，因为我正在填充

数据帧的一部分：
df.loc[df['a']==1].fillna(default_a, inplace=True)

print df

   a  b  c    x    y    z
0  1  0  0  0.5  0.0  0.1
1  0  1  0  0.2  0.0  0.1
2  0  0  1  0.5  0.5  0.5
3  0  0  1  0.2  1.0  0.0
4  1  0  0  NaN  NaN  NaN

但这条长长的队伍确实：
df.loc[df['a']==1] = df.loc[df['a']==1].fillna(default_a)

print df

   a  b  c    x    y    z
0  1  0  0  0.5  0.0  0.1
1  0  1  0  0.2  0.0  0.1
2  0  0  1  0.5  0.5  0.5
3  0  0  1  0.2  1.0  0.0
4  1  0  0  0.2  0.2  0.2

不管怎么说，只是想知道如何使代码尽可能简单。
您可以将a、b、c
列设置为多索引并使用pandas
首先，您需要一个默认框架。在您的设置中，它可以是：
df0 = pd.concat([default_a, default_c], axis=1).T
df0.index = pd.Index([(1, 0, 0), (0, 0, 1)], names=list("abc"))
df0
Out[148]: 
         x    y    z
a b c               
1 0 0  0.2  0.2  0.2
0 0 1  0.5  0.5  0.5

然后将多索引设置为df1，首先应用合并_
，然后重置索引：
df1 = df.set_index(['a', 'b', 'c'])
>>> df1
Out[151]: 
         x    y    z
a b c               
1 0 0  0.5  0.0  0.1
0 1 0  0.2  0.0  0.1
  0 1  NaN  NaN  NaN
    1  0.2  1.0  0.0
1 0 0  NaN  NaN  NaN

df1.combine_first(df0)
Out[152]: 
         x    y    z
a b c               
0 0 1  0.5  0.5  0.5
    1  0.2  1.0  0.0
  1 0  0.2  0.0  0.1
1 0 0  0.5  0.0  0.1
    0  0.2  0.2  0.2

df1.combine_first(df0).reset_index()
Out[154]: 
   a  b  c    x    y    z
0  0  0  1  0.5  0.5  0.5
1  0  0  1  0.2  1.0  0.0
2  0  1  0  0.2  0.0  0.1
3  1  0  0  0.5  0.0  0.1
4  1  0  0  0.2  0.2  0.2

副作用是输出的排序顺序不同。为了保持顺序，我们可以使用原始索引（如果它是单调唯一的，则使用附加的temp列）：
美好的消除了在解决方案中循环列的需要。
df2 = df.reset_index().set_index(['a', 'b', 'c'])
>>> df2
Out[156]: 
       index    x    y    z
a b c                      
1 0 0      0  0.5  0.0  0.1
0 1 0      1  0.2  0.0  0.1
  0 1      2  NaN  NaN  NaN
    1      3  0.2  1.0  0.0
1 0 0      4  NaN  NaN  NaN

df2.combine_first(df0).reset_index().set_index('index').sort_index()
Out[160]: 
       a  b  c    x    y    z
index                        
0      1  0  0  0.5  0.0  0.1
1      0  1  0  0.2  0.0  0.1
2      0  0  1  0.5  0.5  0.5
3      0  0  1  0.2  1.0  0.0
4      1  0  0  0.2  0.2  0.2