Python 比较两个数据帧的多行_Python_Pandas_Named Entity Recognition

Python 比较两个数据帧的多行

python pandas

Python 比较两个数据帧的多行,python,pandas,named-entity-recognition,Python,Pandas,Named Entity Recognition,我必须将数据帧1中出现的一个句子的所有行与数据帧2（包含所有句子的标记）相匹配，并从数据帧2中返回匹配的行我尝试了groupby操作，但它会为每个匹配行返回匹配。我希望df1中的所有代币都匹配，以维持其顺序以下df仅包含一个句子的标记 pdt1 = pd.DataFrame({'Word':['Obesity','in','Low-','and','Middle-Income','Countries'], 'tag':['O','O','O','O','O','O

我必须将数据帧1中出现的一个句子的所有行与数据帧2（包含所有句子的标记）相匹配，并从数据帧2中返回匹配的行

我尝试了groupby操作，但它会为每个匹配行返回匹配。我希望df1中的所有代币都匹配，以维持其顺序

以下df仅包含一个句子的标记

pdt1 = pd.DataFrame({'Word':['Obesity','in','Low-','and','Middle-Income','Countries'], 
             'tag':['O','O','O','O','O','O']})

print(pdt1)

    Word tag
0        Obesity   O
1             in   O
2           Low-   O
3            and   O
4  Middle-Income   O
5      Countries   O

其他数据帧包含所有句子的标记

pdt2 = pd.DataFrame([[1, 1, 1, 'Obesity', 'O'],
       [2, 1, 1, 'in', 'O'],
       [3, 1, 1, 'Low-', 'O'],
       [4, 1, 1, 'and', 'O'],
       [5, 1, 1, 'Middle-Income', 'O'],
       [6, 1, 1, 'Countries', 'O']
       [7, 1, 2, 'We', 'O'],
       [8, 1, 2, 'have', 'O'],
       [9, 1, 2, 'reviewed', 'O'],
       [10, 1, 2, 'the', 'O'],
       [11, 1, 2, 'distinctive', 'O'],
       [12, 1, 2, 'features', 'O'],
       [13, 1, 2, 'of', 'O'],
       [14, 1, 2, 'excess', 'O'],
       [15, 1, 2, 'weight', 'O'],
       [16, 1, 2, ',', 'O'],
       [17, 1, 2, 'its', 'O'],
       [18, 1, 2, 'causes', 'O'],
       [19, 1, 2, ',', 'O'],
       [20, 1, 2, 'and', 'O'],
       [21, 1, 2, 'related', 'O'],
       [22, 1, 2, 'prevention', 'O'],
       [23, 1, 2, 'and', 'O'],
       [24, 1, 2, 'management', 'O'],
       [25, 1, 2, 'efforts', 'O']])

pdt2.columns = ['id','Doc_ID','Sent_ID','Word','tag']
print(pdt2)


     id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O
6    7       1        2             We   O
7    8       1        2           have   O
8    9       1        2       reviewed   O
9   10       1        2            the   O
10  11       1        2    distinctive   O
11  12       1        2       features   O
12  13       1        2             of   O
13  14       1        2         excess   O
14  15       1        2         weight   O
15  16       1        2              ,   O
16  17       1        2            its   O
17  18       1        2         causes   O
18  19       1        2              ,   O
19  20       1        2            and   O
20  21       1        2        related   O
21  22       1        2     prevention   O
22  23       1        2            and   O
23  24       1        2     management   O
24  25       1        2        efforts   O

输出看起来像

id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

你的意思是：

print(pdt1.pdt2[pdt2['Sent_ID'] == 1])

输出：

    id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

            Word tag  id  Doc_ID  Sent_ID
0        Obesity   O   1       1        1
1             in   O   2       1        1
2           Low-   O   3       1        1
3            and   O   4       1        1
4  Middle-Income   O   5       1        1
5      Countries   O   6       1        1

编辑：

print(pdt1.merge(pdt2[pdt2['Sent_ID'] == 1],on=['Word','tag']))

输出：

    id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

            Word tag  id  Doc_ID  Sent_ID
0        Obesity   O   1       1        1
1             in   O   2       1        1
2           Low-   O   3       1        1
3            and   O   4       1        1
4  Middle-Income   O   5       1        1
5      Countries   O   6       1        1

你的意思是：

print(pdt1.pdt2[pdt2['Sent_ID'] == 1])

输出：

    id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

            Word tag  id  Doc_ID  Sent_ID
0        Obesity   O   1       1        1
1             in   O   2       1        1
2           Low-   O   3       1        1
3            and   O   4       1        1
4  Middle-Income   O   5       1        1
5      Countries   O   6       1        1

编辑：

print(pdt1.merge(pdt2[pdt2['Sent_ID'] == 1],on=['Word','tag']))

输出：

    id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

            Word tag  id  Doc_ID  Sent_ID
0        Obesity   O   1       1        1
1             in   O   2       1        1
2           Low-   O   3       1        1
3            and   O   4       1        1
4  Middle-Income   O   5       1        1
5      Countries   O   6       1        1

这应该行得通

pdt2[pdt2[['Word', 'tag']].isin(pdt1).all(axis=1)]

    id  Doc_ID  Sent_ID Word    tag
0   1   1   1   Obesity          O
1   2   1   1   in               O
2   3   1   1   Low-             O
3   4   1   1   and              O
4   5   1   1   Middle-Income    O
5   6   1   1   Countries        O

这应该行得通

pdt2[pdt2[['Word', 'tag']].isin(pdt1).all(axis=1)]

    id  Doc_ID  Sent_ID Word    tag
0   1   1   1   Obesity          O
1   2   1   1   in               O
2   3   1   1   Low-             O
3   4   1   1   and              O
4   5   1   1   Middle-Income    O
5   6   1   1   Countries        O

而且

发送的ID值在df2中，因此此解决方案无法工作。我在数据框1中只有Word和Tag属性。@joel在编辑的响应中明确提到发送的Id之前，你为什么要向下投票。df2中大约有200000个句子。找不到有关已发送Id的先前信息。因此，答案没有用。发送的ID值在df2中，因此此解决方案无法工作。我在数据框1中只有Word和Tag属性。@joel在编辑的响应中明确提到发送的Id之前，你为什么要向下投票。df2中大约有200000个句子。找不到有关已发送Id的先前信息。因此，答案是没有用的。