Python—如何通过保持K个最大值来减少每行或对称矩阵的条目数
我有一个对称的相似矩阵,我想在每一行中只保留k个最大值 这里有一些代码完全符合我的要求,但我想知道是否有更好的方法。尤其是展平/重塑有点笨拙。提前谢谢 请注意,nrows(如下)必须扩展到数万个Python—如何通过保持K个最大值来减少每行或对称矩阵的条目数,python,numpy,similarity,Python,Numpy,Similarity,我有一个对称的相似矩阵,我想在每一行中只保留k个最大值 这里有一些代码完全符合我的要求,但我想知道是否有更好的方法。尤其是展平/重塑有点笨拙。提前谢谢 请注意,nrows(如下)必须扩展到数万个 from scipy.spatial.distance import pdist, squareform random.seed(1) nrows = 4 a = (random.rand(nrows,nrows)) # Generate a symmetric similarity matrix s
from scipy.spatial.distance import pdist, squareform
random.seed(1)
nrows = 4
a = (random.rand(nrows,nrows))
# Generate a symmetric similarity matrix
s = 1-squareform( pdist( a, 'cosine' ) )
print "Start with:\n", s
# Generate the sorted indices
ss = argsort(s.view(np.ndarray), axis=1)[:,::-1]
s2 = ss + (arange(ss.shape[0])*ss.shape[1])[:,None]
# Zero-out after k-largest-value entries in each row
k = 3 # Number of top-values to keep, per row
s = s.flatten()
s[s2[:,k:].flatten()] = 0
print "Desired output:\n", s.reshape(nrows,nrows)
给出:
Start with:
[[ 1. 0.61103296 0.82177072 0.92487807]
[ 0.61103296 1. 0.94246304 0.7212526 ]
[ 0.82177072 0.94246304 1. 0.87247418]
[ 0.92487807 0.7212526 0.87247418 1. ]]
Desired output:
[[ 1. 0. 0.82177072 0.92487807]
[ 0. 1. 0.94246304 0.7212526 ]
[ 0. 0.94246304 1. 0.87247418]
[ 0.92487807 0. 0.87247418 1. ]]
这不是很大的改进,但是为了避免扁平化和重塑,您可以使用
np.put
:
# Generate the sorted indices
ss = np.argsort(s.view(np.ndarray), axis=1)[:,::-1]
ss += (np.arange(ss.shape[0])*ss.shape[1])[:,None] #Add in place, probably trivial improvement
k=3
np.put(s,ss[:,k:],0) #or s.flat[ss[:,k:]]=0
print s
[[ 1. 0. 0.82177072 0.92487807]
[ 0. 1. 0.94246304 0.7212526 ]
[ 0. 0.94246304 1. 0.87247418]
[ 0.92487807 0. 0.87247418 1. ]]
如果您发现自己正在将索引的长列表生成到一个数组中,那么很有可能使用布尔矩阵以更优雅的方式来解决这个问题。就你而言:
a = np.random.rand(5, 5)
a = a + a.T # make it symmetrical
sort_idx = np.argsort(np.argsort(a, axis=1), axis=1)
k = 3 # values to keep
# if you want a copy of the original
mask = (sort_idx >= a.shape[1] - k) # positions we want to keep
b = np.zeros_like(a)
b[mask] = a[mask]
# if you wantrd to do the operation in-place
# mask = (sort_idx < a.shape[1] - k) # positions we want to zero
# a[mask] = 0
>>> a
array([[ 1.87816548, 0.86562424, 1.94171234, 0.96565312, 0.53451029],
[ 0.86562424, 1.13762348, 1.48565754, 0.78031763, 0.51448499],
[ 1.94171234, 1.48565754, 1.39960519, 0.57456214, 1.32608456],
[ 0.96565312, 0.78031763, 0.57456214, 1.56469221, 0.74632264],
[ 0.53451029, 0.51448499, 1.32608456, 0.74632264, 0.55378676]])
>>> b
array([[ 1.87816548, 0. , 1.94171234, 0.96565312, 0. ],
[ 0.86562424, 1.13762348, 1.48565754, 0. , 0. ],
[ 1.94171234, 1.48565754, 1.39960519, 0. , 0. ],
[ 0.96565312, 0.78031763, 0. , 1.56469221, 0. ],
[ 0. , 0. , 1.32608456, 0.74632264, 0.55378676]])
a=np.random.rand(5,5)
a=a+a.T#使其对称
sort_idx=np.argsort(np.argsort(a,轴=1),轴=1)
k=3#要保持的值
#如果你想要原件的复印件
mask=(sort_idx>=a.shape[1]-k)#我们要保留的位置
b=np.类零(a)
b[遮罩]=a[遮罩]
#如果你想在适当的地方做手术
#mask=(sort_idx>>a
数组([[1.87816548,0.86562424,1.94171234,0.96565312,0.53451029],
[ 0.86562424, 1.13762348, 1.48565754, 0.78031763, 0.51448499],
[ 1.94171234, 1.48565754, 1.39960519, 0.57456214, 1.32608456],
[ 0.96565312, 0.78031763, 0.57456214, 1.56469221, 0.74632264],
[ 0.53451029, 0.51448499, 1.32608456, 0.74632264, 0.55378676]])
>>>b
数组([[1.87816548,0,1.94171234,0.96565312,0.],
[ 0.86562424, 1.13762348, 1.48565754, 0. , 0. ],
[ 1.94171234, 1.48565754, 1.39960519, 0. , 0. ],
[ 0.96565312, 0.78031763, 0. , 1.56469221, 0. ],
[ 0. , 0. , 1.32608456, 0.74632264, 0.55378676]])
正是我想要的。谢谢奥菲恩!您的第一行隐藏了最大值(1.909…),它应该保留每行的三个(或“k”)最大值,对吗?@zbinsd是的,您完全正确。通过再次运行argsort
可以很容易地解决此问题,请参见我的编辑。我还更改了索引条件,以保留最大值,而不是最小值。