Python 获取前n列并忽略NaN
我正在努力获得前n列(在我的例子中为n=3),同时忽略了n。我的数据集:Python 获取前n列并忽略NaN,python,pandas,Python,Pandas,我正在努力获得前n列(在我的例子中为n=3),同时忽略了n。我的数据集: import numpy as np import pandas as pd x = {'ID':['1','2','3','4','5'], 'productA':[0.47, 0.65, 0.48, 0.58, 0.67], 'productB':[0.65,0.47,0.55, np.NaN, np.NaN], 'productC':[0.78, np.NaN, np.NaN, n
import numpy as np
import pandas as pd
x = {'ID':['1','2','3','4','5'],
'productA':[0.47, 0.65, 0.48, 0.58, 0.67],
'productB':[0.65,0.47,0.55, np.NaN, np.NaN],
'productC':[0.78, np.NaN, np.NaN, np.NaN, np.NaN],
'productD':[np.NaN, np.NaN, 0.25, np.NaN, np.NaN],
'productE':[0.12, np.NaN, 0.47, 0.12, np.NaN]}
df = pd.DataFrame(x)
我期望的结果:
身份证件
top3
A1
productC-productB-productA
A2
产品A-产品B
A3
productB-productA-productE
A4
productA-productE
A5
产品A
尝试使用:
df.set_index("ID").apply(
lambda x: pd.Series(x.nlargest(3).index).tolist(), axis=1
)
您可以使用with过滤掉
NaN
s。那就行了
arr = df.iloc[:, 1:].to_numpy() # Leaving out `ID` col
idx = arr.argsort(axis=1)
m = np.isnan(arr)
m = m[np.arange(arr.shape[0])[:,None], idx]
out = df.columns[1:].to_numpy()[idx]
out = [v[~c][-3:] for v, c in zip(out, m)]
pd.Series(out, index= df['ID'])
ID
1 [productA, productB, productC]
2 [productB, productA]
3 [productE, productA, productB]
4 [productE, productA]
5 [productA]
dtype: object
而且可能很慢。但是您可以利用NumPy函数(矢量化)来获得一些效率
In [152]: %%timeit
...: df.set_index('ID').apply(lambda x: pd.Series(x.nlargest(3).index).toli
...: st(), axis=1)
...:
...:
2.04 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %%timeit
...: arr = df.iloc[:, 1:].to_numpy() # Leaving out `ID` col
...: idx = arr.argsort(axis=1)
...: m = np.isnan(arr)
...: m = m[np.arange(arr.shape[0])[:,None], idx]
...: out = df.columns[1:].to_numpy()[idx]
...: out = [v[~c][-3:] for v, c in zip(out, m)]
...:
...:
144 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
性能几乎提高了14倍。我建议直接使用numpy 根据你的经验,你可能会觉得有点困惑和糊涂(我当然会)
这是一个很好的解决方案,如果性能没有问题(不太可能),那么它看起来就像一个解决方案winner@Matt:使用快速程序库。它将使应用操作更快。使用更快的库进行应用将提供更好的性能。:)您还可以添加与pandas apply+swifter的性能比较吗?
In [152]: %%timeit
...: df.set_index('ID').apply(lambda x: pd.Series(x.nlargest(3).index).toli
...: st(), axis=1)
...:
...:
2.04 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %%timeit
...: arr = df.iloc[:, 1:].to_numpy() # Leaving out `ID` col
...: idx = arr.argsort(axis=1)
...: m = np.isnan(arr)
...: m = m[np.arange(arr.shape[0])[:,None], idx]
...: out = df.columns[1:].to_numpy()[idx]
...: out = [v[~c][-3:] for v, c in zip(out, m)]
...:
...:
144 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
import numpy as np
# your data
d = {
'productA':[0.47, 0.65, 0.48, 0.58, 0.67],
'productB':[0.65,0.47,0.55, np.NaN, np.NaN],
'productC':[0.78, np.NaN, np.NaN, np.NaN, np.NaN],
'productD':[np.NaN, np.NaN, 0.25, np.NaN, np.NaN],
'productE':[0.12, np.NaN, 0.47, 0.12, np.NaN]
}
# replae your nans with -infs as otherwise they are counted as high
for k,v in d.items():
d[k] = [-np.inf if i is np.NaN else i for i in v]
# store as a matrix
matrix = np.array(list(d.values()))
# your ids are 1 to 5
for i in range(1, 6):
print(f"ID: {i}")
# arg sort axis=0 will order how you want (by ooing over the horizontal axis)
# you then want to select the i-1th column [::, i-1]
# and do reverse order [::-1]
print(np.argsort(matrix, axis=0)[::, i - 1][::-1])