Python 根据单词对中的第二个单词，合并单词对列表中的第一个单词_Python_List_Python 3.x_Tuples_Grouping

Python 根据单词对中的第二个单词，合并单词对列表中的第一个单词

python list python-3.x

Python 根据单词对中的第二个单词，合并单词对列表中的第一个单词,python,list,python-3.x,tuples,grouping,Python,List,Python 3.x,Tuples,Grouping,我有一个程序（NLTK-NER）为我提供以下列表： [ ('Barak', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('president', 'O'), ('of', 'O'), ('United', 'LOCATION'), ('States', 'LOCATION'), ('of', 'LOCATION'), ('America',

我有一个程序（NLTK-NER）为我提供以下列表：

[
    ('Barak', 'PERSON'),
    ('Obama', 'PERSON'),
    ('is', 'O'),
    ('the', 'O'),
    ('president', 'O'),
    ('of', 'O'),
    ('United', 'LOCATION'),
    ('States', 'LOCATION'),
    ('of', 'LOCATION'),
    ('America', 'LOCATION')
]

正如你所看到的，“巴拉克”和“奥巴马”是“人”类型的词，我想把它们（以及“地点”类型的词）合并在一起，如下所示：

['Barak Obama','is','the','president', 'of','United States of America']

如何解决这个问题？

这是我想到的第一件事，非常确定它可以优化，但这是一个好的开始

    classified_text = [('Barak', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('president', 'O'), ('of', 'O'), ('United', 'LOCATION'), ('States', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION')]

    # Reverse the list so it pops the first element
    classified_text.reverse()
    # Create an aux list to store the result and add the first item
    new_text = [classified_text.pop(), ]
    # Iterate over the text
    while classified_text:
        old_word = new_text[-1]
        new_word = classified_text.pop()

        # If previous word has same type, merge. 
        # Avoid merging 'O' types
        if old_word[1] == new_word[1] and new_word[1] != 'O':
            new_text[-1] = (
                ' '.join((old_word[0], new_word[0])),
                new_word[1],
            )

        # If not just add the tuple
        else:
            new_text.append(new_word)

    # Remove the types from the list and you have your result
    new_text = [x[0] for x in new_text]

我们在这里要做的，本质上是将一些

分类文本

的项目组合在一起…因此，这是有道理的，可以提供帮助。首先，我们需要一个键函数，该函数将带有标签

'PERSON'

或

'LOCATION'

的项视为相似项，将所有其他项视为不同项

这稍微复杂一点，因为我们需要一种方法来区分具有相同标签的相邻项目（除了

'PERSON'

或

'LOCATION'

），例如

（'is'，'O'），（'the'，'O'）

等。我们可以使用：

>>> list(enumerate(classified_text))
[..., (2, ('is', 'O')), (3, ('the', 'O')), (4, ('president', 'O')), ...]

现在我们知道了将提供什么作为

groupby（）

的输入，我们可以编写我们的关键函数：

def person_or_location(item):
    index, (word, tag) = item
    if tag in {'PERSON', 'LOCATION'}:
        return tag
    else:
        return index

请注意，赋值中

索引（word，tag）

的结构与枚举列表中每个项目的结构相匹配

一旦我们得到了它，我们就可以编写另一个函数来进行实际的合并：

from itertools import groupby

def merge(tagged_text):
    enumerated_text = enumerate(tagged_text)
    grouped_text = groupby(enumerated_text, person_or_location)
    return [
        ' '.join(word for index, (word, tag) in group)
        for key, group in grouped_text
    ]

这就是它的作用：

>>> merge(classified_text)
['Barak Obama', 'is', 'the', 'president', 'of', 'United States of America']