Python 在数据帧中合并行_Python_Pandas_String_Dataframe_Pandas Groupby

Python 在数据帧中合并行

python pandas string dataframe

Python 在数据帧中合并行,python,pandas,string,dataframe,pandas-groupby,Python,Pandas,String,Dataframe,Pandas Groupby,我有一个DF，它具有NER分类器的结果，例如： df = s token pred tokenID 17 hakawati B-Loc 3 17 theatre L-Loc 3 17 jerusalem U-Loc 7 56 university B-Org 5 56 of I-Org

我有一个DF，它具有NER分类器的结果，例如：

df =

s        token        pred       tokenID
17     hakawati       B-Loc         3
17     theatre        L-Loc         3
17     jerusalem      U-Loc         7
56     university     B-Org         5
56     of             I-Org         5
56     texas          I-Org         5
56     here           L-Org         6
...
5402   dwight         B-Peop        1    
5402   d.             I-Peop        1
5402   eisenhower     L-Peop        1

此数据框中还有许多其他列不相关。现在，我想根据标记的sentenceID（=s）和预测标记对标记进行分组，以将它们组合成一个实体：

df2 =


s        token                        pred               
17     hakawati  theatre           Location
17     jerusalem                   Location
56     university of texas here    Organisation
...
5402   dwight d. eisenhower        People

通常情况下，我会通过简单地使用

data\u map=df.groupby（[“s”]，as\u index=False，sort=False）.agg（“.join）

并使用重命名函数。然而，由于数据包含不同类型的字符串（B，I，L-Loc/Org…），我不知道如何准确地执行它

欢迎提出任何意见

有什么想法吗？

您可以通过

和

tokenID

进行分组，并按如下方式进行聚合：

def aggregate(df):
    token = " ".join(df.token)
    pred = df.iloc[0].pred.split("-", 1)[1]
    return pd.Series({"token": token, "pred": pred})

df.groupby(["s", "tokenID"]).apply(aggregate)

# Output
                             token  pred
s    tokenID                            
17   3            hakawati theatre   Loc
     7                   jerusalem   Loc
56   5         university of texas   Org
     6                        here   Org
5402 1        dwight d. eisenhower  Peop

一个解决方案通过辅助列

df['pred_cat'] = df['pred'].str.split('-').str[-1]

res = df.groupby(['s', 'pred_cat'])['token']\
        .apply(' '.join).reset_index()

print(res)

      s pred_cat                       token
0    17      Loc  hakawati theatre jerusalem
1    56      Org    university of texas here
2  5402     Peop        dwight d. eisenhower

注意，这与您期望的输出不完全匹配；似乎涉及到一些特定于数据的处理。

为什么您的结果将

耶路撒冷

分割到另一行，而

这里

是

德克萨斯大学这里

的一部分？如果分类器是完美的，我会这样做。然而，有时结果并不准确，在这种情况下，“德克萨斯大学”被标记为一个单独的实体。因此，我不能按令牌ID分组。@ThelMi，当然，已更新。但另一种方法是你看到耶路撒冷哈瓦蒂剧院，你可能也不想要。是的，我在询问之前已经尝试了这两种版本，但都不需要输出。谢谢你的回答，不过我想我必须过滤掉错误分类的标签，并尝试不同的方法。