Python 与MATLAB'；s"；ismember“；功能_Python_Matlab_Optimization_Numpy

Python 与MATLAB'；s"；ismember“；功能

python matlab optimization numpy

Python 与MATLAB'；s"；ismember“；功能,python,matlab,optimization,numpy,Python,Matlab,Optimization,Numpy,在多次尝试优化代码之后，最后一个资源似乎是尝试使用多个内核运行下面的代码。我不知道如何转换/重新构造我的代码，以便它可以使用多个内核更快地运行。如果我能得到指导以实现最终目标，我将不胜感激。最终目标是能够尽可能快地为数组A和B运行此代码，其中每个数组包含大约700000个元素。下面是使用小数组的代码。700k元素数组被注释掉 import numpy as np def ismember(a,b): for i in a: index = np.where(b==i)[

在多次尝试优化代码之后，最后一个资源似乎是尝试使用多个内核运行下面的代码。我不知道如何转换/重新构造我的代码，以便它可以使用多个内核更快地运行。如果我能得到指导以实现最终目标，我将不胜感激。最终目标是能够尽可能快地为数组A和B运行此代码，其中每个数组包含大约700000个元素。下面是使用小数组的代码。700k元素数组被注释掉

import numpy as np

def ismember(a,b):
    for i in a:
        index = np.where(b==i)[0]
        if index.size == 0:
            yield 0
        else:
            yield index


def f(A, gen_obj):
    my_array = np.arange(len(A))
    for i in my_array:
        my_array[i] = gen_obj.next()
    return my_array


#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

gen_obj = ismember(A,B)

f(A, gen_obj)

print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.

我想做的是模拟一个名为[2]的MATLAB函数（格式为：

[Lia，Locb]=ismember（a，B）

。我只想得到

Locb

部分

从Matlab:Locb中，对于A中作为B成员的每个值，包含B中的最低索引。如果A不是B成员，则输出数组Locb包含0

其中一个主要问题是，我需要尽可能高效地执行此操作。为了进行测试，我有两个700k元素的数组。创建生成器并遍历生成器的值似乎无法快速完成任务。

尝试使用列表理解

In [1]: import numpy as np

In [2]: A = np.array([3,4,4,3,6])

In [3]: B = np.array([2,5,2,6,3])

In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]

通常，列表理解比循环快得多

获得等长列表

In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]

这对于小型数据集来说相当快：

In [20]: C = np.arange(10000)

In [21]: D = np.arange(15000, 25000)

In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop

对于大型数据集，您可以尝试使用

multiprocessing.Pool.map（）

来加速操作。

在担心多核之前，我会使用字典消除ismember函数中的线性扫描：

def ismember(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

原始实现要求对a中的每个元素对B中的元素进行完整扫描，使其

O（len（a）*len（B））

。上述代码要求对B进行一次完整扫描以生成dict Bset。通过使用dict，可以有效地使对a中每个元素的B中每个元素的查找保持不变，从而使操作

O（len（a）+len（B））

。如果这仍然太慢，请担心使上述函数在多个内核上运行

编辑：我还稍微修改了索引。Matlab使用0，因为它的所有数组都从索引1开始。Python/numpy的数组从0开始，所以如果您是数据集，则如下所示

A = [2378, 2378, 2378, 2378]
B = [2378, 2379]

对于no元素返回0，则结果将排除A的所有元素。上述例程对于no索引返回

None

，而不是0。返回-1是一个选项，但Python会将其解释为数组中的最后一个元素。

None

如果用作数组中的索引，将引发异常。如果与其他行为类似，将

Bind.get（item，None）

表达式中的第二个参数更改为您想要返回的值。

sfstewman的优秀答案很可能为您解决了这个问题

我只想补充一点，你可以在numpy实现同样的目标

我使用numpy的一个函数

B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)

```
B_unique_sorted
```
包含
```
B
```
sorted中的唯一值
```
B_idx
```
为这些值保留原始
```
B
```
中的索引

B_in_A_bool

是一个布尔数组，大小为

B_unique_sorted

存储

B\u unique\u sorted

中的值是否在

中
注意：我需要在A中查找（来自B的唯一VAL），因为我需要返回关于

B\u idx的输出

注意：我假设A
已经是唯一的


现在，您可以使用B_in_A_bool
获取公共VAL
B_unique_sorted[B_in_A_bool]

以及原始B

B_idx[B_in_A_bool]

最后，我假设这比纯Python for循环快得多，尽管我没有对其进行测试。
这里是一个确切的MATLAB等价物，它返回与MATLAB匹配的两个输出参数[Lia，Locb]，Python中除外。0也是一个有效的索引。因此，此函数不返回0。它本质上返回Locb（Locb>0）性能与MATLAB相当
def ismember(a_vec, b_vec):
    """ MATLAB equivalent ismember function """

    bool_ind = np.isin(a_vec,b_vec)
    common = a[bool_ind]
    common_unique, common_inv  = np.unique(common, return_inverse=True)     # common = common_unique[common_inv]
    b_unique, b_ind = np.unique(b_vec, return_index=True)  # b_unique = b_vec[b_ind]
    common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
    return bool_ind, common_ind[common_inv]

另一种实现速度稍慢（约5倍），但不使用unique函数，如下所示：
def ismember(a_vec, b_vec):
    ''' MATLAB equivalent ismember function. Slower than above implementation'''
    b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
    indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
    booleans = np.in1d(a_vec, b_vec)
    return booleans, np.array(indices, dtype=int)

尝试ismember
库
pip install ismember

简单的例子：
# Import library
from ismember import ismember
import numpy as np

# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

# Lookup
Iloc,idx = ismember(A, B)
 
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False  True  True]

# indexes of d_unique that exists in d
print(idx)
# [4 4 3]

print(B[idx])
# [3 3 6]

print(A[Iloc])
# [3 3 6]

# These vectors will match
A[Iloc]==B[idx]

速度检查：
from ismember import ismember
from datetime import datetime

t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)

for n in ns:
    a_vec = np.random.randint(0,100,n)
    b_vec = np.random.randint(0,100,n)

    # Run stack version
    start = datetime.now()
    out1=ismember_stack(a_vec, b_vec)
    end = datetime.now()
    t1.append(end - start)

    # Run ismember
    start = datetime.now()
    out2=ismember(a_vec, b_vec)
    end = datetime.now()
    t2.append(end - start)


print(np.sum(t1))
# 0:00:07.778331

print(np.sum(t2))
# 0:00:04.609801

# %%
def ismember_stack(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

pypi的ismember
函数几乎快了2倍
大向量，例如700000个元素：
from ismember import ismember
from datetime import datetime

A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)

# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()

# Print time
print(end-start)
# 0:00:01.194801

输出数组需要保持相同的大小。@z5151:请参见增强型答案。如果需要，您可以将lambda
表达式更改为返回0而不是False，但这会在结果中掩盖实0。这对于元素数较少的数组很有用。感谢您强调列表理解比循环快得多。您的答案返回的是元素，而不是B中元素的索引。哇，这太快了！您不知道我多么欣赏您的解决方案。非常感谢！您是否使用特定工具输出性能配置文件？@z5151不，这是简单的算法分析。使用：np。其中
必须执行线性扫描对B
，它需要O（len（B））
操作。然后使用需要O（len（A））
操作的外循环，使原始算法大致O（len（A）*len（B））
操作。生成Bind
需要len（B）
操作。字典实现为，具有常量O（1）
查找，因此扫描A是O（len（A））
；总体复杂度是O（len（A）+len（B））
。明白了。谢谢你的维基百科参考。@EOL不，你破坏了代码。返回的元素现在是列表中最后一个出现的元素，而不是第一个。我没有在原始代码中使用字典理解是有原因的。@EOL据我所知，你可以通过在相反的范围内迭代来使用字典理解e:{B[i]：i代表x范围内的i（le