Python 熊猫闭包表_Python_Pandas

Python 熊猫闭包表

python pandas

Python 熊猫闭包表,python,pandas,Python,Pandas,我想用熊猫做一个闭合表。假设您有分层数据，类似这样的数据具有给定的ID： df = pd.DataFrame( { 'unit_0': ['A','A','A','A','A','A','A','A'], 'unit_1': ['B','C','C','C','D','D','E','E'], 'unit_2': ['F','G','G','H','I','I','J','J'] } ) units = [col for col

我想用熊猫做一个闭合表。假设您有分层数据，类似这样的数据具有给定的ID：

df = pd.DataFrame(
    {
        'unit_0': ['A','A','A','A','A','A','A','A'],
        'unit_1': ['B','C','C','C','D','D','E','E'],
        'unit_2': ['F','G','G','H','I','I','J','J']
    }
)

units = [col for col in df]

closure = (df[units].melt(var_name='depth')
                    .drop_duplicates()
                    .rename(columns={'value': 'unit_name'}))

closure['unit_name_id'] = range(0, len(closure))

现在我想给表

parent\u unit\u id

如下所示：

depth   unit_name   unit_name_id    parent_unit_id                  
unit_0  A           0               
unit_1  B           1               0
unit_1  C           2               0
unit_1  D           3               0
unit_1  E           4               0
unit_2  F           5               1
unit_2  G           6               2
unit_2  H           7               2
unit_2  I           8               3
unit_2  J           9               4

在本例中，每个子对象只有一个父对象，但如果帧看起来像这样（单位_2中的最后一个J变为I），该怎么办

因此，我的

父单元id

将是一个列表

[3,4]

，下面应该可以实现这一点：

unit\u name\u to\u id={
单位名称：单位id
对于单元名称，单元id
在闭包[[“单位名称”，“单位名称\u id”]]中。值
}
def get_父项（df、单位名称到单位id、深度、单位名称）：
单位编号=整数（深度分割（“”）[1]）
父单元编号=单元编号-1
父单位列=f“单位{父单位编号}”
如果父单元列不在df中：
返回[]
父项=df[df[深度]==单位名称][父项单位列]
返回parents.map（unit\u name\u to\u id）.unique（）.tolist（）
闭包[“父单元ID”]=闭包\
.apply（lambda行：获取父项（df，单位名称到单位id，行[“深度”]，行[“单位名称”]），轴=1）

注意，这使用了

pd.Series.apply（）

，它在内部迭代所有行，因此速度很慢。如果您需要更快的解决方案，请告诉我作为一个注释，我们也可以使用

merge

和

groupby

加快速度，只需构建（前置）图并将其映射到

单元名称id

列：

import pandas as pd
from collections import defaultdict

df = pd.DataFrame(
    {
        'unit_0': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
        'unit_1': ['B', 'C', 'C', 'C', 'D', 'D', 'E', 'E'],
        'unit_2': ['F', 'G', 'G', 'H', 'I', 'I', 'J', 'J']
    }
)

units = [col for col in df]
closure = (df[units].melt(var_name='depth')
           .drop_duplicates()
           .rename(columns={'value': 'unit_name'}))
closure['unit_name_id'] = range(0, len(closure))


def parents(frame, close):
    predecessors = defaultdict(set)
    lookup = {k: v for k, v in close[['unit_name', 'unit_name_id']].values}
    for row in frame.values:
        for i, node in enumerate(row[1:], 1):
            predecessors[lookup[node]].add(lookup[row[i - 1]])
    return {k: list(predecessors[k]) or [] for k in close['unit_name_id']}


closure['parent_unit_id'] = closure['unit_name_id'].map(parents(df, closure))

print(closure)

输出

     depth unit_name  unit_name_id parent_unit_id
0   unit_0         A             0             []
8   unit_1         B             1            [0]
9   unit_1         C             2            [0]
12  unit_1         D             3            [0]
14  unit_1         E             4            [0]
16  unit_2         F             5            [1]
17  unit_2         G             6            [2]
19  unit_2         H             7            [2]
20  unit_2         I             8            [3]
22  unit_2         J             9            [4]

将

与

交换会产生：

     depth unit_name  unit_name_id parent_unit_id
0   unit_0         A             0             []
8   unit_1         B             1            [0]
9   unit_1         C             2            [0]
12  unit_1         D             3            [0]
14  unit_1         E             4            [0]
16  unit_2         F             5            [1]
17  unit_2         G             6            [2]
19  unit_2         H             7            [2]
20  unit_2         I             8         [3, 4]
22  unit_2         J             9            [4]

比较两种解决方案得出以下结果：

%timeit solution_bstadlbauer(df, closure)
9.29 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit solution_danimesejo(df, closure)
1.28 ms ± 86.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

对于较大的数据帧，差异可能会增加。可以找到比较解决方案的代码

很好，谢谢！用于

unit\u name\u to\u id=pd.Series（closure.unit\u id.values，index=closure.unit\u name）.to\u dict（）

和一种更通用的方法，该函数在

父单元列中不硬编码unit\u
%timeit solution_bstadlbauer(df, closure)
9.29 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit solution_danimesejo(df, closure)
1.28 ms ± 86.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)