Python 优化嵌套循环_Python_Numpy

Python 优化嵌套循环

python numpy

Python 优化嵌套循环,python,numpy,Python,Numpy,我有一个包含a、B、C、D列范围（0或1）和包含其交互的AB、AC、BC、CD列范围（也可以是0或1）的熊猫数据帧根据相互作用，我想确定“三胞胎”ABC、ABD、ACD、BCD的存在，如下MWE所示： import numpy as np import pandas as pd df = pd.DataFrame() np.random.seed(1) df["A"] = np.random.randint(2, size=10) df["B"] = np.random.randint(2

我有一个包含a、B、C、D列范围（0或1）和包含其交互的AB、AC、BC、CD列范围（也可以是0或1）的熊猫数据帧

根据相互作用，我想确定“三胞胎”ABC、ABD、ACD、BCD的存在，如下MWE所示：

import numpy as np
import pandas as pd
df = pd.DataFrame()

np.random.seed(1)

df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)

df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)

ls = ["A", "B", "C", "D"]
for i, a in enumerate(ls):
    for j in range(i + 1, len(ls)):
        b = ls[j]
        for k in range(j + 1, len(ls)):
            c = ls[k]
            idx = a+b+c

            idx_abc = (df[a]>0) & (df[b]>0) & (df[c]>0)
            sum_abc = df[idx_abc][a+b] + df[idx_abc][b+c] + df[idx_abc][a+c]

            df[a+b+c]=0
            df.loc[sum_abc.index[sum_abc>=2], a+b+c] = 999

这将提供以下输出：

   A  B  C  D  AB  AC  AD  BC  BD  CD  ABC  ABD  ACD  BCD
0  1  0  0  0   1   0   0   1   1   0    0    0    0    0
1  1  1  1  0   1   1   1   1   0   0  999    0    0    0
2  0  0  0  1   1   0   1   0   0   1    0    0    0    0
3  0  1  0  1   1   0   0   0   1   1    0    0    0    0
4  1  1  1  1   1   1   1   0   1   1  999  999  999  999
5  1  0  0  1   1   1   1   0   0   0    0    0    0    0
6  1  0  0  1   0   1   1   1   1   1    0    0    0    0
7  1  1  0  0   1   0   1   1   1   1    0    0    0    0
8  1  0  1  0   1   1   0   1   0   0    0    0    0    0
9  0  0  0  0   0   0   0   0   1   1    0    0    0    0

代码背后的逻辑如下：如果AB、AC、BC列中至少有两列处于活动状态（=1），且单个A、B、C列均处于活动状态（=1），则三元组ABC处于活动状态（=1）

我总是从查看各个列开始（在ABC的情况下，这是A、B和C）。查看A、B和C列，我们只“保留”A、B和C均为非零的行。然后，看看AB、AC和BC之间的相互作用，我们只会在AB、AC和BC中至少有两个是1的情况下“启用”三元组ABC——它们只适用于第1行和第4行！因此，对于第1行和第4行，ABC=999，对于所有其他行，ABC=0。我对所有可能的三胞胎都这样做（本例中为4个）

由于数据帧很小，上述代码运行速度很快。然而，在我的实际代码中，dataframe有超过一百万行和数百个交互，在这种情况下，它的运行速度非常慢

有没有一种方法可以优化上述代码，例如通过多线程来优化它？

这里有一种方法比参考代码快10倍左右。它没有做什么特别聪明的事情，只是对行人进行优化

import numpy as np
import pandas as pd
df = pd.DataFrame()

np.random.seed(1)

df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)

df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)

ls = ["A", "B", "C", "D"]

def op():
    out = df.copy()
    for i, a in enumerate(ls):
        for j in range(i + 1, len(ls)):
            b = ls[j]
            for k in range(j + 1, len(ls)):
                c = ls[k]
                idx = a+b+c

                idx_abc = (out[a]>0) & (out[b]>0) & (out[c]>0)
                sum_abc = out[idx_abc][a+b] + out[idx_abc][b+c] + out[idx_abc][a+c]

                out[a+b+c]=0
                out.loc[sum_abc.index[sum_abc>=2], a+b+c] = 99
    return out

import scipy.spatial.distance as ssd

def pp():
    data = df.values
    n = len(ls)
    d1,d2 = np.split(data, [n], axis=1)
    i,j = np.triu_indices(n,1)
    d2 = d2 & d1[:,i] & d1[:,j]
    k,i,j = np.ogrid[:n,:n,:n]
    k,i,j = np.where((k<i)&(i<j))
    lu = ssd.squareform(np.arange(n*(n-1)//2))
    d3 = ((d2[:,lu[k,i]]+d2[:,lu[i,j]]+d2[:,lu[k,j]])>=2).view(np.uint8)*99
    *triplets, = map("".join, combinations(ls,3))
    out = df.copy()
    out[triplets] = pd.DataFrame(d3, columns=triplets)
    return out

from string import ascii_uppercase
from itertools import combinations, chain

def make(nl=8, nr=1000000, seed=1):
    np.random.seed(seed)
    letters = np.fromiter(ascii_uppercase, 'U1', nl)
    df = pd.DataFrame()
    for l in chain(letters, map("".join,combinations(letters,2))):
        df[l] = np.random.randint(0,2,nr,dtype=np.uint8)
    return letters, df

df1 = op()
df2 = pp()
assert (df1==df2).all().all()

ls, df = make(8,1000)

df1 = op()
df2 = pp()
assert (df1==df2).all().all()

from timeit import timeit

print(timeit(op,number=10))
print(timeit(pp,number=10))

ls, df = make(26,250000)
import time

t0 = time.perf_counter()
df2 = pp()
t1 = time.perf_counter()
print(t1-t0)

你能再解释一下逻辑吗？此示例不可显式复制，因为您使用随机值作为输入。查看静态输入、从该输入生成的输出以及在此上下文中“三元组”到底是什么的逻辑会很有帮助？这对性能没有帮助，但为了清理这些循环，您可以使用@G.Anderson。我添加了一个种子，并对逻辑作了更多解释。请让我知道，如果有什么仍然不清楚，你不需要多线程，你需要学习如何传递你需要屏蔽函数。在本例中，如果“ABC”与“A”&“B”和“C”相同，则可以使用

df['ABC']=0；df['ABC'][（df['A']==1）和（df['B']==1）和（df['C']==1）]=999

@hilberts\u酗酒问题我有近3000个三胞胎（对应于近30个“单独的”列A、B、C等）和500.000行

3.2022583668585867 # op 8 symbols, 1000 rows, 10 repeats
0.2772211490664631 # pp 8 symbols, 1000 rows, 10 repeats
12.412292044842616 # pp 26 symbols, 250,000 rows, single run