Python 在数据帧的每行中保留前n个非NaN单元格
我有一个熊猫数据帧,每行至少有4个非NaN值, 但位于不同的列:Python 在数据帧的每行中保留前n个非NaN单元格,python,pandas,dataframe,nan,Python,Pandas,Dataframe,Nan,我有一个熊猫数据帧,每行至少有4个非NaN值, 但位于不同的列: Index Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 1991-12-31 100.000 100.000 100.000 89.123 NaN NaN NaN NaN 1992-01-31 98.300 101.530 100.000 N
Index Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
1991-12-31 100.000 100.000 100.000 89.123 NaN NaN NaN NaN
1992-01-31 98.300 101.530 100.000 NaN 92.342 NaN NaN NaN
1992-02-29 NaN 100.230 98.713 97.602 NaN NaN NaN NaN
1992-03-31 NaN NaN 102.060 93.473 98.123 NaN NaN NaN
1992-04-30 NaN 102.205 107.755 94.529 94.529 NaN NaN NaN
(我只显示前8列)我想将其转换为一个数据帧,每行有4列。
行应仅包含该日期的前四个(从左到右读取)非NaN值
编辑:
每行的顺序很重要。您可以使用:
#if necessary
#df = df.set_index('Index')
df = df.apply(lambda x: pd.Series(x.dropna().values), axis=1).iloc[:, :4]
print (df)
0 1 2 3
Index
1991-12-31 100.000 100.000 100.000 89.123
1992-01-31 98.300 101.530 100.000 92.342
1992-02-29 100.230 98.713 97.602 NaN
1992-03-31 102.060 93.473 98.123 NaN
1992-04-30 102.205 107.755 94.529 94.529
或者为了获得更好的性能,请使用numpy
-处理需求时,每行至少有4个非值:
a = df.values
df = pd.DataFrame(a[~np.isnan(a)].reshape(a.shape[0],-1)[:, :4], index=df.index)
计时:
Index Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
0 1991-12-31 100.0 100.000 100.000 89.123 NaN NaN NaN NaN
1 1992-01-31 98.3 101.530 100.000 NaN 92.342 NaN NaN NaN
2 1992-02-29 NaN 100.230 98.713 97.602 NaN NaN NaN 1.0
3 1992-03-31 NaN NaN 102.060 93.473 98.123 NaN NaN 1.0
4 1992-04-30 NaN 102.205 107.755 94.529 94.529 NaN NaN NaN
df = df.set_index('Index')
df = pd.concat([df] * 10000, ignore_index=1)
In [260]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 6.78 ms per loop
In [261]: %%timeit a = df.values
...: pd.DataFrame(a[~np.isnan(a)].reshape(a.shape[0],-1)[:, :4], index=df.index)
...:
100 loops, best of 3: 2.11 ms per loop
In [262]: %timeit pd.DataFrame(np.sort(df.values, axis=1)[:, :4], columns=np.arange(1, 5)).add_prefix('Col')
100 loops, best of 3: 5.28 ms per loop
In [263]: %timeit pd.DataFrame(mask_app(df.values)[:,:4])
100 loops, best of 3: 8.68 ms per loop
如果顺序不重要,可以调用
np.sort
,沿第一个轴排序
df = df.set_index('Index') # ignore if `Index` already is the index
pd.DataFrame(np.sort(df.values, axis=1)[:, :4],
columns=np.arange(1, 5)).add_prefix('Col')
Col1 Col2 Col3 Col4
0 89.123 100.000 100.000 100.000
1 92.342 98.300 100.000 101.530
2 97.602 98.713 100.230 NaN
3 93.473 98.123 102.060 NaN
4 94.529 94.529 102.205 107.755
<>这比我的第二个解决方案快得多,所以如果这是可能的,一定要考虑这个问题。
如果顺序很重要,请调用排序后的
+apply
,并获取结果的前4列
df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
Col1 Col2 Col3 Col4
Index
1991-12-31 100.000 100.000 100.000 89.123
1992-01-31 98.300 101.530 100.000 92.342
1992-02-29 100.230 98.713 97.602 NaN
1992-03-31 102.060 93.473 98.123 NaN
1992-04-30 102.205 107.755 94.529 94.529
计时
以下是我回答问题的时间安排-
df = pd.concat([df] * 10000, ignore_index=1)
%timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 8.45 s per loop
pd.DataFrame(np.sort(df.values, axis=1)[:, :4],
columns=np.arange(1, 5)).add_prefix('Col')
100 loops, best of 3: 4.76 ms per loop
方法#1:这里有一个使用-
样本运行-
In [211]: df
Out[211]:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
Index
1991-12-31 100.0 100.000 100.000 89.123 NaN NaN NaN NaN
1992-01-31 98.3 101.530 100.000 NaN 92.342 NaN NaN NaN
1992-02-29 NaN 100.230 98.713 97.602 NaN NaN NaN NaN
1992-03-31 NaN NaN 102.060 93.473 98.123 NaN NaN NaN
1992-04-30 NaN 102.205 107.755 94.529 94.529 NaN NaN NaN
In [212]: pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
Out[212]:
0 1 2 3
0 100.000 100.000 100.000 89.123
1 98.300 101.530 100.000 92.342
2 100.230 98.713 97.602 NaN
3 102.060 93.473 98.123 NaN
4 102.205 107.755 94.529 94.529
方法#2:为口罩使用定制功能-
def app2(df, N=4):
a = df.values
out = np.empty_like(a)
mask = df.isnull().values
mask_sorted = np.sort(mask,1)
out[~mask_sorted] = a[~mask]
return pd.DataFrame(out[:,:N])
运行时测试可维持秩序的工作解决方案-
# Using df from posted question to recreate a bigger one :
df = df.set_index('Index')
df = pd.concat([df] * 10000, ignore_index=1)
In [298]: %timeit app2(df)
100 loops, best of 3: 4.06 ms per loop
In [299]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 4.78 ms per loop
In [300]: %timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 4.05 s per loop
每行中的顺序重要吗?如果不是,则可能提供一个高性能的解决方案。我很重要(此注释的其余部分将达到字符限制),我假设OP只希望保留4列,即使超过4列不为空。这将给出ValueError:无法将大小为18的数组重新整形为形状(5,newaxis)
。。。请再检查一遍好吗?@cᴏʟᴅsᴘᴇᴇᴅ - 我添加了一些其他值,因为OP保证了至少4个非nan值。我不知道你是否可以假设,因为我没有看到OP在任何地方提到它。。。我错了吗?是的,检查问题的第一句。好的,太棒了。让我给我的答案加上计时。@Divakar您在这里的解决方案也会更快。我在np.isnan(a)
行得到TypeError:ufunc'isnan'不支持输入类型。有什么想法吗?@colspeed我想你可以试试pd.isnull(jsut drop-in-replace,应该可以)啊,没关系。没有设置索引是我的错误。顺便说一句,这比您以前的解决方案慢了几毫秒。排序现在太旧了:)好的,可以,需要几秒钟@Divakar的另外两个函数的计时answer@Bharath请随意编辑Divakar或我的答案(有点忙的atm)
# Using df from posted question to recreate a bigger one :
df = df.set_index('Index')
df = pd.concat([df] * 10000, ignore_index=1)
In [298]: %timeit app2(df)
100 loops, best of 3: 4.06 ms per loop
In [299]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 4.78 ms per loop
In [300]: %timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 4.05 s per loop