Pandas 熊猫，每x行取决于其他行的值_Pandas

Pandas 熊猫，每x行取决于其他行的值

pandas

Pandas 熊猫，每x行取决于其他行的值,pandas,Pandas,我有一个数据集，其中的值在不同的行中既是父项又是子项。父级和子级之间的ID格式略有不同，因此我应该能够用正则表达式识别它们结构是这样的 Parent ID | Other data Child ID | Other data Child ID | Other data Child ID | Other data Parent ID | Other data Child ID | Other data Parent ID | Other data Child ID | Other data Ch

我有一个数据集，其中的值在不同的行中既是父项又是子项。父级和子级之间的ID格式略有不同，因此我应该能够用正则表达式识别它们

结构是这样的

Parent ID | Other data
Child ID | Other data
Child ID | Other data
Child ID | Other data
Parent ID | Other data
Child ID | Other data
Parent ID | Other data
Child ID | Other data
Child ID | Other data
Child ID | Other data

Parent ID | Other data
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Parent ID | Other data | 
Child ID | Other data | Parent ID
Parent ID | Other data |
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID

孩子的数量是不确定的，但唯一正确的是，父母会先来，然后是孩子，然后是下一个父母，然后是孩子，依此类推

我不知道如何确定这一点。理想情况下，我能够遍历这些行，并在不同（新）行中使用父ID标记所有子项

这不是一个很好的结构，但它是从数据源中得到的

我想要这样的输出

Parent ID | Other data
Child ID | Other data
Child ID | Other data
Child ID | Other data
Parent ID | Other data
Child ID | Other data
Parent ID | Other data
Child ID | Other data
Child ID | Other data
Child ID | Other data

Parent ID | Other data
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Parent ID | Other data | 
Child ID | Other data | Parent ID
Parent ID | Other data |
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID
Child ID | Other data | Parent ID

因此，整个文件，数千行，遵循这种格式，首先列出一个父级，然后列出它的所有子级，然后列出下一个父级。

您当然可以使用

ffill

和一些掩蔽

# identify all parents
# replace with your regex
patt = '(Parent)'
is_parent = df['ID'].str.extract(patt).notnull()[0]

# ids:
df['parent_ID'] = df['ID'].where(is_parent).ffill().mask(is_parent)

输出：

          ID        data   ParentID
0  Parent ID  Other data        NaN
1   Child ID  Other data  Parent ID
2   Child ID  Other data  Parent ID
3   Child ID  Other data  Parent ID
4  Parent ID  Other data        NaN
5   Child ID  Other data  Parent ID
6  Parent ID  Other data        NaN
7   Child ID  Other data  Parent ID
8   Child ID  Other data  Parent ID
9   Child ID  Other data  Parent ID

您当然可以使用

ffill

和一些掩蔽来实现这一点

# identify all parents
# replace with your regex
patt = '(Parent)'
is_parent = df['ID'].str.extract(patt).notnull()[0]

# ids:
df['parent_ID'] = df['ID'].where(is_parent).ffill().mask(is_parent)

输出：

          ID        data   ParentID
0  Parent ID  Other data        NaN
1   Child ID  Other data  Parent ID
2   Child ID  Other data  Parent ID
3   Child ID  Other data  Parent ID
4  Parent ID  Other data        NaN
5   Child ID  Other data  Parent ID
6  Parent ID  Other data        NaN
7   Child ID  Other data  Parent ID
8   Child ID  Other data  Parent ID
9   Child ID  Other data  Parent ID

请张贴您的预期输出。有助于完全理解你的问题谢谢你的提醒，我编辑了这篇文章。这可能是个问题。请为不同的父ID确定一些数字或不同的值，父ID:9843112356，子7744321，因此格式不同，我应该能够用正则表达式或其他逻辑识别它们——主要问题是遍历数据帧并正确标记它们的逻辑。也就是说，如果我做一个for循环。我将从第1行开始作为父行，在到达新的父行之前，我将标记每个子行。然后，当一个新的父母出现时，我也会这样做，重复直到完成。不，我认为你不需要按行做。你可以使用熊猫的矢量化方法。只需为您的数据更改几行（假数字可以，而不是通用的父id和子id），我或其他人应该能够为您提供一个可能的解决方案。请发布您的预期输出。有助于完全理解你的问题谢谢你的提醒，我编辑了这篇文章。这可能是个问题。请为不同的父ID确定一些数字或不同的值，父ID:9843112356，子7744321，因此格式不同，我应该能够用正则表达式或其他逻辑识别它们——主要问题是遍历数据帧并正确标记它们的逻辑。也就是说，如果我做一个for循环。我将从第1行开始作为父行，在到达新的父行之前，我将标记每个子行。然后，当一个新的父母出现时，我也会这样做，重复直到完成。不，我认为你不需要按行做。你可以使用熊猫的矢量化方法。只需为您的数据更改几行（假数字可以，而不是通用的父id和子id），我或其他人应该能够为您提供一个可能的解决方案