Python：如何摆脱嵌套循环？_Python_Python 3.x_Pandas_Nested Loops

Python：如何摆脱嵌套循环？

python python-3.x pandas

Python：如何摆脱嵌套循环？,python,python-3.x,pandas,nested-loops,Python,Python 3.x,Pandas,Nested Loops,我有两个for循环，一个接一个，我想以某种方式摆脱它们以提高代码速度。pandas中的My dataframe如下所示（标题表示不同的公司，行表示不同的用户，1表示用户访问了该公司，否则为0）：我想比较数据集中的每一对公司，为此，我创建了一个包含所有公司ID的列表。代码查看列表时，首先获取第一个公司（基本公司），然后与其他公司（对等公司）配对，从而生成第二个“for”循环。我的代码如下： def calculate_scores(): df_matrix = create_the_ma

我有两个for循环，一个接一个，我想以某种方式摆脱它们以提高代码速度。pandas中的My dataframe如下所示（标题表示不同的公司，行表示不同的用户，1表示用户访问了该公司，否则为0）：

我想比较数据集中的每一对公司，为此，我创建了一个包含所有公司ID的列表。代码查看列表时，首先获取第一个公司（基本公司），然后与其他公司（对等公司）配对，从而生成第二个“for”循环。我的代码如下：

def calculate_scores():
    df_matrix = create_the_matrix(df)
    print(df_matrix)
    for base in list_of_companies:
        counter = 0
        for peer in list_of_companies:
            counter += 1
            if base == peer:
                "do nothing"
            else:
                # Calculate first the denominator since we slice the big matrix
            # In dataframes that only have accessed the base firm
            denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
            denominator = denominator_df.sum(axis=1).values.tolist()
            denominator = sum(denominator) - len(denominator)

            # Calculate the numerator. This is done later because
            # We slice up more the dataframe above by
            # Filtering records which have been accessed by both the base and the peer firm
            numerator_df = denominator_df.loc[(denominator_df[base] == 1) & (denominator_df[peer] == 1)]
            numerator = len(numerator_df.index)
            annual_search_fraction = numerator/denominator
            print("Base: {} and Peer: {} ==> {}".format(base, peer, annual_search_fraction))

编辑1（添加代码解释）：

指标如下：

Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0

1）我试图计算的指标将告诉我，与所有其他搜索相比，两个公司同时被搜索的次数

2）代码首先选择访问基本公司的所有用户（

分母\u df=df\u matrix.loc[（df\u matrix[base]==1）]

）行。然后计算分母，计算用户搜索到的基础公司和任何其他公司之间的唯一组合数量，因为我可以计算（用户访问的）公司数量，我可以减去1得到基础公司和其他公司之间的唯一链接数量

3）接下来，代码过滤前面的

分母_df

，以仅选择访问基本公司和对等公司的行。因为我需要计算访问基本公司和对等公司的用户数，所以我使用命令：

numerator=len（numerator_df.index）

来计算行数，这将给出分子

顶部数据帧的预期输出如下所示：

Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0

4）检查代码是否给出了正确的解决方案：1家基础公司和所有其他同行公司之间的所有指标总和必须为1。他们在我发布的代码中也这么做了

如果您有任何建议或提示，我们将不胜感激

您可能正在寻找itertools.product（）。下面是一个与您似乎想做的事情类似的示例：

import itertools

a = [ 'one', 'two', 'three' ]

for b in itertools.product( a, a ):
    print( b )

上述代码段的输出为：

('one', 'one')
('one', 'two')
('one', 'three')
('two', 'one')
('two', 'two')
('two', 'three')
('three', 'one')
('three', 'two')
('three', 'three')

或者你可以这样做：

for u,v in itertools.product( a, a ):
    print( "%s %s"%(u, v) )

alist = list( itertools.product( a, a ) ) )

print( alist )

然后输出为

one one
one two
one three
two one
two two
two three
three one
three two
three three

如果您想要列表，可以执行以下操作：

for u,v in itertools.product( a, a ):
    print( "%s %s"%(u, v) )

alist = list( itertools.product( a, a ) ) )

print( alist )

输出是,

[('one', 'one'), ('one', 'two'), ('one', 'three'), ('two', 'one'), ('two', 'two'), ('two', 'three'), ('three', 'one'), ('three', 'two'), ('three', 'three')]

请解释背后的逻辑，而预期输出听起来像是您需要使用数据帧的笛卡尔积，解释如何做到这一点。如果性能不是一个问题，我倾向于简单的语法，即分配一个伪键，然后以这种方式合并。可能有几种方法可以删除nest for循环，但基于您想用它做什么，我认为您无法提高它的时间复杂度。如果我理解正确的话，您将永远陷于一个On^2时间复杂性中。我的最佳建议是尽可能多地将对象移动到for循环之外，以便在每次传递中有更少的操作。关于删除嵌套循环的目标，了解为什么这是期望的目标会很有帮助，然后可以使用它来设计答案。如果你的目标是性能，你可能会认为你在“BASIC”上执行冗余逻辑（例如计算分母）。如果您将其移出第二个for循环，您将看到一些时间节省，因为您正在执行相同的计算| len（公司列表）|次。>Yatu，我已经包含了更多关于代码的详细信息>ALollz，我将研究笛卡尔积。谢谢你的建议。不幸的是，性能是这里的问题。>山姆，我一定会把更多的东西移出主循环