使用numpy为某些统计对象创建集合词典_Numpy_Dictionary_Collections_Statistics

使用numpy为某些统计对象创建集合词典

numpy dictionary collections statistics

使用numpy为某些统计对象创建集合词典,numpy,dictionary,collections,statistics,Numpy,Dictionary,Collections,Statistics,我想使用numpy为一些统计对象创建一个集合字典，简化状态如下分别有一个标量数组，标记为 a=np.数组（[n1，n2，n3…]）和一个2D数组作为 b=np.array（[[q1,q1,q2]，[q2,q2,q2,q3,q3,q2]…]）对于a中的每个元素ni，我想挑选qi（[qi_1，qi_2]）中包含ni的所有元素ni，并使用键作为ni来收集它们为此，我将一种笨拙的方法（假设a和b已确定）记录到以下代码中： import numpy as np a = np.array([i+1

我想使用numpy为一些统计对象创建一个集合字典，简化状态如下

分别有一个标量数组，标记为

a=np.数组（[n1，n2，n3…]）

和一个2D数组作为

b=np.array（[[q1,q1,q2]，[q2,q2,q2,q3,q3,q2]…]）

对于

中的每个元素

ni

，我想挑选

qi（[qi_1，qi_2]）

中包含

ni

的所有元素

ni

，并使用

键作为ni
来收集它们
为此，我将一种笨拙的方法（假设a
和b
已确定）记录到以下代码中：
import numpy as np

a = np.array([i+1 for i in range(100)])
b = np.array([[2*i+1,2*(i+1)] for i in range(50)])
dict = {}
for i in a: dict[i] = [j for j in b if i in j]

毫无疑问，当a
和b较大时，这将非常缓慢。
有没有其他有效的方法来取代上述方法？
寻求你的帮助
 Numpy数组允许元素级比较：
equal = b[:,:,np.newaxis]==a #np.newaxis to broadcast
# if one of the two is equal, we will include this element
index = np.logical_or(equal[:,0], equal[:,1])
# indexing by a boolean array to get the result
dictionary = {i: b[index[:,i]] for i in range(len(a))}

最后一句话：你确定要用字典吗？这样你就失去了很多numpy的优势
编辑，回答您的评论：
a和b这么大，相等的大小为10^10，即8*10^10字节，约为72 G。这就是为什么会出现此错误
你应该问的主要问题是：我真的需要这么大的阵列吗？如果是的话，你确定这本词典不会那么大吗
这个问题可以通过不一次计算所有内容来解决，但是在n
次中，n
应该是72/16（内存中的比例）。但是，n稍大一点可能会加快该过程：
stride = int(len(b)/n)
dictionary = {}
for i in range(n):
    #splitting b into several parts
    equal = b[n*stride:(n+1)*stride,:,np.newaxis]==a 
    index = np.logical_or(equal[:,0], equal[:,1])
    dictionary.update( {i: b[index[:,i]] for i in range(len(a))})

Numpy阵列允许元素级比较：
equal = b[:,:,np.newaxis]==a #np.newaxis to broadcast
# if one of the two is equal, we will include this element
index = np.logical_or(equal[:,0], equal[:,1])
# indexing by a boolean array to get the result
dictionary = {i: b[index[:,i]] for i in range(len(a))}

最后一句话：你确定要用字典吗？这样你就失去了很多numpy的优势
编辑，回答您的评论：
a和b这么大，相等的大小为10^10，即8*10^10字节，约为72 G。这就是为什么会出现此错误
你应该问的主要问题是：我真的需要这么大的阵列吗？如果是的话，你确定这本词典不会那么大吗
这个问题可以通过不一次计算所有内容来解决，但是在n
次中，n
应该是72/16（内存中的比例）。但是，n稍大一点可能会加快该过程：
stride = int(len(b)/n)
dictionary = {}
for i in range(n):
    #splitting b into several parts
    equal = b[n*stride:(n+1)*stride,:,np.newaxis]==a 
    index = np.logical_or(equal[:,0], equal[:,1])
    dictionary.update( {i: b[index[:,i]] for i in range(len(a))})

谢谢你的主意。它可以完全解决我的问题。您的核心概念是对a和b进行比较，并得到布尔数组作为结果。因此，对数组b使用这个布尔索引来构建字典要快得多。按照这个想法，我用我自己的方式重写你的代码
dict = {}
for item in a:
    index_left, index_right = (b[:,0]==item), (b[:,1]==item)
    index = np.logical_or(index_left, index_right)
    dict[item] = dict[index]

这些代码仍然不比您的代码快，但即使在较大的a和b中（例如a=100000和b=200000），也可以避免“记忆错误”
谢谢您的想法。它可以完全解决我的问题。您的核心概念是对a和b进行比较，并得到布尔数组作为结果。因此，对数组b使用这个布尔索引来构建字典要快得多。按照这个想法，我用我自己的方式重写你的代码
dict = {}
for item in a:
    index_left, index_right = (b[:,0]==item), (b[:,1]==item)
    index = np.logical_or(index_left, index_right)
    dict[item] = dict[index]

这些代码仍然不比您的代码快，但即使在较大的a和b中也可以避免“内存错误”（例如a=100000和b=200000）
谢谢您的建议，您的代码比我的代码快得多，但当a和b较大时（例如a=100000和b=50000，RAM为16G），仍然会给出“内存错误”，然而，我的代码在这种情况下不再工作。如何改进您的代码以阻止该错误？@zgfu1985我修改了代码，这应该可以解决问题，考虑一下您是否真的需要如此大的阵列谢谢您的建议，您的代码比我的代码快得多，但当a和b较大时（例如，a=100000和b=50000，RAM为16G），仍然会出现“内存错误”，然而，我的代码在这种情况下不再工作。如何改进您的代码以阻止错误？@zgfu1985我修改了代码，这应该可以解决问题，想想您是否真的需要这么大的数组