Python 优化/删除循环_Python_Numpy_Networkx

Python 优化/删除循环

python numpy

Python 优化/删除循环,python,numpy,networkx,Python,Numpy,Networkx,我有下面一段代码，我想使用numpy优化它，最好是删除循环。我不知道该怎么做，所以任何建议都会有帮助索引是一个整数的（N，2）numpy数组，N可以是几百万。代码所做的是在第一列中查找重复的索引。对于这些索引，我在第二列中对两个相应的索引进行了所有组合。然后我将它们与第一列中的索引一起收集 index_sets = [] uniques, counts = np.unique(indices[:,0], return_counts=True) potentials = uniques[coun

我有下面一段代码，我想使用numpy优化它，最好是删除循环。我不知道该怎么做，所以任何建议都会有帮助

索引是一个整数的（N，2）numpy数组，N可以是几百万。代码所做的是在第一列中查找重复的索引。对于这些索引，我在第二列中对两个相应的索引进行了所有组合。然后我将它们与第一列中的索引一起收集

index_sets = []
uniques, counts = np.unique(indices[:,0], return_counts=True)
potentials = uniques[counts > 1]
for p in potentials:
    correspondents = indices[(indices[:,0] == p),1]
    combs = np.vstack(list(combinations(correspondents, 2)))
    combs = np.hstack((np.tile(p, (combs.shape[0], 1)), combs))
    index_sets.append(combs)

可以提出几点改进意见：

初始化输出数组，我们可以预先计算存储每个组对应的组合所需的估计行数。我们知道，对于
```
N
```
元素，可能的组合总数将为
```
N*（N-1）/2
```
，以给出每组的组合长度。此外，输出数组中的行总数将是所有这些间隔长度的总和
在进入循环之前，以矢量化的方式预先计算尽可能多的内容
使用循环获得组合，因为参差不齐的图案无法矢量化。使用
```
np.repeat
```
模拟平铺，并在循环之前进行，为每个组提供第一个元素，从而获得输出数组的第一列

因此，考虑到所有这些改进，实现将如下所示-

# Remove rows with counts == 1 
_,idx, counts = np.unique(indices[:,0], return_index=True, return_counts=True)
indices = np.delete(indices,idx[counts==1],axis=0)

# Decide the starting indices of corresponding to start of new groups 
# charaterized by new elements along the sorted first column
start_idx = np.unique(indices[:,0], return_index=True)[1]
all_idx = np.append(start_idx,indices.shape[0])

# Get interval lengths that are required to store pairwise combinations
# of each group for unique ID from column-0
interval_lens = np.array([item*(item-1)/2 for item in np.diff(all_idx)])

# Setup output array and set the first column as a repeated array
out = np.zeros((interval_lens.sum(),3),dtype=int)
out[:,0] = np.repeat(indices[start_idx,0],interval_lens)

# Decide the start-stop indices for storing into output array 
ssidx = np.append(0,np.cumsum(interval_lens))

# Finally run a loop gto store all the combinations into initialized o/p array
for i in range(idx.size):
    out[ssidx[i]:ssidx[i+1],1:] = \
    np.vstack(combinations(indices[all_idx[i]:all_idx[i+1],1],2))

请注意，输出数组将是一个大的

（M，3）

形状的数组，并且不会像原始代码那样拆分为数组列表。如果仍然需要，可以使用

np.split

进行同样的操作

此外，快速运行时测试表明，所建议的代码没有多大改进。因此，可能运行时的大部分时间都花在了获取组合上。因此，似乎特别适合这种基于连接的问题的替代方法可能更合适。

这里有一个在N上矢量化的解决方案。注意，它仍然包含一个for循环，但它是每个“密钥多重性组”上的循环，这保证是一个更小的数字（通常最多几十个）

对于N=1.000.000，在我的电脑上，运行时间是一秒钟的数量级

import numpy_indexed as npi
N = 1000000
indices = np.random.randint(0, N/10, size=(N, 2))

def combinations(x):
    """vectorized computation of combinations for an array of sequences of equal length

    Parameters
    ----------
    x : ndarray, [..., n_items]

    Returns
    -------
    ndarray, [..., n_items * (n_items - 1) / 2, 2]
    """
    return np.rollaxis(x[..., np.triu_indices(x.shape[-1], 1)], -2, x.ndim+1)

def process(indices):
    """process a subgroup of indices, all having equal multiplicity

    Parameters
    ----------
    indices : ndarray, [n, 2]

    Returns
    -------
    ndarray, [m, 3]
    """
    keys, vals = npi.group_by(indices[:, 0], indices[:, 1])
    combs = combinations(vals)
    keys = np.repeat(keys, combs.shape[1])
    return np.concatenate([keys[:, None], combs.reshape(-1, 2)], axis=1)

index_groups = npi.group_by(npi.multiplicity(indices[:, 0])).split(indices)
result = np.concatenate([process(ind) for ind in index_groups])

免责声明：我是该软件包的作者。

听起来像是网络问题，所以可能需要查看

networkx

模块。我尝试将其与我的答案进行对比，但它给出了N=1.000.000的记忆错误：）您尝试过吗？我很想知道它在实践中的效果。很抱歉我还不能尝试。我不得不切换到另一个优先级更高的任务，但我肯定会使用这段代码。我将在一天结束时尝试测试它并报告。我最终对您发布的代码进行了一些测试，它确实产生了相同的结果，但比我的原始代码快了大约26倍！！非常感谢您的回答和numpy_索引模块，我认为这是一个非常有用的模块，我可能会再次使用它；我觉得你提出的问题很有趣