Python 查找数据帧中最长的父子链_Python_Pandas_Algorithm_Dataframe

Python 查找数据帧中最长的父子链

python pandas algorithm dataframe

Python 查找数据帧中最长的父子链,python,pandas,algorithm,dataframe,Python,Pandas,Algorithm,Dataframe,场景我有一个数据帧。每一行包含一个项目，该项目可以但不一定与父项目或子项目链接，如双链接列表。行未排序，但父项id必须小于子项id 将熊猫作为pd导入将numpy作为np导入 df=pd.DataFrame（列=['Item Id'，'Parent Id'，'Child Id']，数据=[[1006，np.nan，np.nan]， [1001，np.nan，1005]， [1004, 1003, 1007], [1003, 1002, 1004], [10051001，np.nan]， [

场景

我有一个数据帧。每一行包含一个项目，该项目可以但不一定与父项目或子项目链接，如双链接列表。行未排序，但父项id必须小于子项id

将熊猫作为pd导入
将numpy作为np导入
df=pd.DataFrame（列=['Item Id'，'Parent Id'，'Child Id']，
数据=[[1006，np.nan，np.nan]，
[1001，np.nan，1005]，
[1004, 1003, 1007],
[1003, 1002, 1004],
[10051001，np.nan]，
[1002，np.nan，1003]，
[10071004，np.nan]
])
打印（df）
#项目Id父Id子Id
#0 1006楠楠
#1 1001 NaN 1005.0
# 2     1004     1003.0    1007.0
# 3     1003     1002.0    1004.0
#410051001.0NaN
#5 1002南1003.0
#610071004.0 NaN

因此，数据帧包含3个链：

1001=>1005
1002=>1003=>1004=>1007
1006

问题

如何在此数据帧中找到最长链的长度？（即给定数据帧中的3）

我将获取所有父Id中带有“np.nan”的父项。递归检查每个父项，直到找到最长的链。或者也可以做相反的操作，查找子Id中带有“np.nan”的子Id，它们是链的最后一部分，并递归返回，直到没有父Id为止。

这是一种方法。这根本不是优化的，但它将得到您想要的，无需递归：

data = [[1006, None, None],
        [1001, None, 1005],
        [1004, 1003, 1007],
        [1003, 1002, 1004],
        [1005, 1001, None],
        [1002, None, 1003],
        [1007, 1004, None]
    ]


class Node:
    def __init__(self, value, parent=None, child=None):
        self.value = value
        self.parent = parent
        self.child = child


nodes = {}
parent_ids = []

for entry in data:
    (itm, parent, child) = entry
    nodes[itm] = Node(itm, parent, child)
    if parent is None:
        parent_ids.append(itm)

for parent_id in parent_ids:
    chain = [str(parent_id)]
    node = nodes[parent_id]
    while node.child is not None:
        chain.append(str(node.child))
        node = nodes[node.child]
    print(" -> ".join(chain))

输出：

1006
1001 -> 1005
1002 -> 1003 -> 1004 -> 1007

好吧，熊猫和潜在的小矮人都不擅长解决图形问题

但您可以用列表表示每个链，构建所有链的列表，然后对其进行排序。我会使用辅助命令将每个项目链接到其链：

chains = []
seen = {}

for _, row in df.sort_values("Item Id").iterrows():
    itemId = row['Item Id']
    childId = row['Child Id']
    if itemId in seen:
        chain = seen[itemId]
    else:                                     # this is a new chain
        chain = seen[itemId] = [itemId]
        chains.append(chain)
    if not np.isnan(childId):                 # add the child to the end of the chain
        seen[childId] = chain
        chain.append(childId)
chains.sort(key=lambda x: len(x))             # and sort the list of chains

（此算法使用父项id必须小于子项id的属性）

您的输入数据框为：

>>> print(chains)
[[1006.0], [1001.0, 1005.0], [1002.0, 1003.0, 1004.0, 1007.0]]

根据@bli的建议，我使用将数据帧转换为有向图，并使用

dag\u longest\u path（）

和

dag\u longest\u path\u length（）获得答案
将networkx导入为nx
G=nx.from_pandas_edgelist（df[~df['Child Id'].isna（）]，'Item Id'，'Child Id'，
edge\u attr=True，使用=nx.DiGraph（）创建

输出
>>打印（nx.dag\u最长路径（G））
[1002, 1003, 1004, 1007.0]
>>>打印（nx.dag\u最长路径\u长度（G））
3.
我建议使用数据框中的信息创建一个使用networkx的图形，该图形可能已经为此类问题实现了一个算法。感谢@bli-我根据您的建议制定了一个解决方案，我不知道networkx甚至有来自_*

方法的

。