Python 3.x 如何合并多字标签？_Python 3.x_Ner_Natural Language Processing_Allennlp

Python 3.x 如何合并多字标签？

python-3.x

Python 3.x 如何合并多字标签？,python-3.x,ner,natural-language-processing,allennlp,Python 3.x,Ner,Natural Language Processing,Allennlp,我目前正在使用allennlp进行NER标记代码：是否有任何解析器可以合并下面的输出，使其返回“Top Gun”并标记“WORK\u OF_ART” 您可以更改模型路径并尝试使用您的模型路径 from allennlp.predictors.predictor import Predictor predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.1

我目前正在使用allennlp进行NER标记

代码：

是否有任何解析器可以合并下面的输出，使其返回“Top Gun”并标记“WORK\u OF_ART”

您可以更改模型路径并尝试使用您的模型路径

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz") # change model path
sentence = "Did Uriah honestly think he could beat The Legend of Zelda in under three hours?"
result = predictor.predict(sentence)

lang = {}

completeWord = ""

for word, tag in zip(result["words"], result["tags"]):
    if(tag.startswith("B")):
        completeWord = completeWord + " " +word
        completeWord = completeWord + " " +word
    elif(tag.startswith("L")):
        completeWord = completeWord + " " +word
        lang[completeWord] = tag.split("-")[1]
        completeWord = ""
    else:
        lang[word] = tag

print(lang)

>>>{' The Legend of Zelda': 'MISC',
 '?': 'O',
 'Did': 'O',
 'Uriah': 'U-PER',
 'beat': 'O',
 'could': 'O',
 'he': 'O',
 'honestly': 'O',
 'hours': 'O',
 'in': 'O',
 'think': 'O',
 'three': 'O',
 'under': 'O'}

如果有用，请将其标记为已接受

此存储库包含所有AllenNLP模块的下载路径。你可以随时下载你需要的东西。点击

从下面的路径下载AllenNLP NER预训练模型点击

安装ALLENNLP和ALLENNLP型号

pip安装allennlp

pip安装allennlp型号

导入所需的AllenNlp模块

进口allennlp

从allennlp.predictor.predictor导入预测器

预测器=预测器。从_路径（“https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz））

Predict函数调用AllenNLP的Predictor.Predict函数，该函数通过一段文本来分析非结构化文本中的命名实体，并将其分类为预定义的类别（单词、标记、掩码和logit）。例如一个人的名字、位置、地标等作为图书馆（Pythoncode）

BILOU方法/模式（我希望AllenNLP使用BILOU模式）

点击

输入

导入所需的包

    import allennlp
    from allennlp.predictors.predictor import Predictor
    predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
      

    document = """The U.S. is a country of 50 states covering a vast swath of North America, with Alaska in the northwest and Hawaii extending the nation’s presence into the Pacific Ocean. Major Atlantic Coast cities are New York, a global finance and culture center, and capital Washington, DC. Midwestern metropolis Chicago is known for influential architecture and on the west coast, Los Angeles' Hollywood is famed for filmmaking"""


    ####### Convert Entities ##########
    def convert_results(allen_results):
        ents = set()
        for word, tag in zip(allen_results["words"], allen_results["tags"]):
            if tag != "O":
                ent_position, ent_type = tag.split("-")
                if ent_position == "U":
                    ents.add((word,ent_type))
                else:
                  if ent_position == "B":
                      w = word
                  elif ent_position == "I":
                      w += " " + word
                  elif ent_position == "L":
                      w += " " + word
                  ents.add((w,ent_type))
        return ents
    

    def allennlp_ner(document):
        return convert_results(predictor.predict(sentence=document))

    results = predictor.predict(sentence=document)
    
    [tuple(i) for i in zip(results["words"],results["tags"])]

    ##Output##
    [('The', 'O'),
    ('U.S.', 'U-LOC'),
    ('is', 'O'),
    ('a', 'O'),
    ('country', 'O'),
    ('of', 'O'),
    ('50', 'O'),
    ('states', 'O'),
    ('covering', 'O'),
    ('a', 'O'),
    ('vast', 'O'),
    ('swath', 'O'),
    ('of', 'O'),
    ('North', 'B-LOC'),
    ('America', 'L-LOC'),
    (',', 'O'),
    ('with', 'O'),
    ('Alaska', 'U-LOC'),
    ('in', 'O'),
    ('the', 'O'),
    ('northwest', 'O'),
    ('and', 'O'),
    ('Hawaii', 'U-LOC'),
    ('extending', 'O'),
    ('the', 'O'),
    ('nation', 'O'),
    ('’s', 'O'),
    ('presence', 'O'),
    ('into', 'O'),
    ('the', 'O'),
    ('Pacific', 'B-LOC'),
    ('Ocean', 'L-LOC'),
    ('.', 'O'),
    ('Major', 'B-LOC'),
    ('Atlantic', 'I-LOC'),
    ('Coast', 'L-LOC'),
    ('cities', 'O'),
    ('are', 'O'),
    ('New', 'B-LOC'),
    ('York', 'L-LOC'),
    (',', 'O'),
    ('a', 'O'),
    ('global', 'O'),
    ('finance', 'O'),
    ('and', 'O'),
    ('culture', 'O'),
    ('center', 'O'),
    (',', 'O'),
    ('and', 'O'),
    ('capital', 'O'),
    ('Washington', 'U-LOC'),
    (',', 'O'),
    ('DC', 'U-LOC'),
    ('.', 'O'),
    ('Midwestern', 'U-MISC'),
    ('metropolis', 'O'),
    ('Chicago', 'U-LOC'),
    ('is', 'O'),
    ('known', 'O'),
    ('for', 'O'),
    ('influential', 'O'),
    ('architecture', 'O'),
    ('and', 'O'),
    ('on', 'O'),
    ('the', 'O'),
    ('west', 'O'),
    ('coast', 'O'),
    (',', 'O'),
    ('Los', 'B-LOC'),
    ('Angeles', 'L-LOC'),
    ("'", 'O'),
    ('Hollywood', 'U-LOC'),
    ('is', 'O'),
    ('famed', 'O'),
    ('for', 'O'),
    ('filmmaking', 'O')]

    # Merging Multiword NER Tags using convert_results
    allennlp_ner(document)
    
    # the output print like this

    {('Alaska', 'LOC'),
    ('Chicago', 'LOC'),
    ('DC', 'LOC'),
    ('Hawaii', 'LOC'),
    ('Hollywood', 'LOC'),
    ('Los', 'LOC'),
    ('Los Angeles', 'LOC'),
    ('Major', 'LOC'),
    ('Major Atlantic', 'LOC'),
    ('Major Atlantic Coast', 'LOC'),
    ('Midwestern', 'MISC'),
    ('New', 'LOC'),
    ('New York', 'LOC'),
    ('North', 'LOC'),
    ('North America', 'LOC'),
    ('Pacific', 'LOC'),
    ('Pacific Ocean', 'LOC'),
    ('U.S.', 'LOC'),
    ('Washington', 'LOC')}

我在下面给出了解决方案，请检查并让我知道使用转换结果合并多字NER标记

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz") # change model path
sentence = "Did Uriah honestly think he could beat The Legend of Zelda in under three hours?"
result = predictor.predict(sentence)

lang = {}

completeWord = ""

for word, tag in zip(result["words"], result["tags"]):
    if(tag.startswith("B")):
        completeWord = completeWord + " " +word
        completeWord = completeWord + " " +word
    elif(tag.startswith("L")):
        completeWord = completeWord + " " +word
        lang[completeWord] = tag.split("-")[1]
        completeWord = ""
    else:
        lang[word] = tag

print(lang)

>>>{' The Legend of Zelda': 'MISC',
 '?': 'O',
 'Did': 'O',
 'Uriah': 'U-PER',
 'beat': 'O',
 'could': 'O',
 'he': 'O',
 'honestly': 'O',
 'hours': 'O',
 'in': 'O',
 'think': 'O',
 'three': 'O',
 'under': 'O'}

| ------|--------------------------------------|
| BEGIN | The first token of a final entity    |
| ------|--------------------------------------| 
| IN    | An inner token of a final entity     |
| ------|--------------------------------------|
| LAST  | The final token of a final entity    |
| ------|--------------------------------------| 
| Unit  | A single-token entity                |
| ------|--------------------------------------|
| Out   | A non-entity token entity            |
| ------|--------------------------------------|

    import allennlp
    from allennlp.predictors.predictor import Predictor
    predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
      

    document = """The U.S. is a country of 50 states covering a vast swath of North America, with Alaska in the northwest and Hawaii extending the nation’s presence into the Pacific Ocean. Major Atlantic Coast cities are New York, a global finance and culture center, and capital Washington, DC. Midwestern metropolis Chicago is known for influential architecture and on the west coast, Los Angeles' Hollywood is famed for filmmaking"""


    ####### Convert Entities ##########
    def convert_results(allen_results):
        ents = set()
        for word, tag in zip(allen_results["words"], allen_results["tags"]):
            if tag != "O":
                ent_position, ent_type = tag.split("-")
                if ent_position == "U":
                    ents.add((word,ent_type))
                else:
                  if ent_position == "B":
                      w = word
                  elif ent_position == "I":
                      w += " " + word
                  elif ent_position == "L":
                      w += " " + word
                  ents.add((w,ent_type))
        return ents
    

    def allennlp_ner(document):
        return convert_results(predictor.predict(sentence=document))

    results = predictor.predict(sentence=document)
    
    [tuple(i) for i in zip(results["words"],results["tags"])]

    ##Output##
    [('The', 'O'),
    ('U.S.', 'U-LOC'),
    ('is', 'O'),
    ('a', 'O'),
    ('country', 'O'),
    ('of', 'O'),
    ('50', 'O'),
    ('states', 'O'),
    ('covering', 'O'),
    ('a', 'O'),
    ('vast', 'O'),
    ('swath', 'O'),
    ('of', 'O'),
    ('North', 'B-LOC'),
    ('America', 'L-LOC'),
    (',', 'O'),
    ('with', 'O'),
    ('Alaska', 'U-LOC'),
    ('in', 'O'),
    ('the', 'O'),
    ('northwest', 'O'),
    ('and', 'O'),
    ('Hawaii', 'U-LOC'),
    ('extending', 'O'),
    ('the', 'O'),
    ('nation', 'O'),
    ('’s', 'O'),
    ('presence', 'O'),
    ('into', 'O'),
    ('the', 'O'),
    ('Pacific', 'B-LOC'),
    ('Ocean', 'L-LOC'),
    ('.', 'O'),
    ('Major', 'B-LOC'),
    ('Atlantic', 'I-LOC'),
    ('Coast', 'L-LOC'),
    ('cities', 'O'),
    ('are', 'O'),
    ('New', 'B-LOC'),
    ('York', 'L-LOC'),
    (',', 'O'),
    ('a', 'O'),
    ('global', 'O'),
    ('finance', 'O'),
    ('and', 'O'),
    ('culture', 'O'),
    ('center', 'O'),
    (',', 'O'),
    ('and', 'O'),
    ('capital', 'O'),
    ('Washington', 'U-LOC'),
    (',', 'O'),
    ('DC', 'U-LOC'),
    ('.', 'O'),
    ('Midwestern', 'U-MISC'),
    ('metropolis', 'O'),
    ('Chicago', 'U-LOC'),
    ('is', 'O'),
    ('known', 'O'),
    ('for', 'O'),
    ('influential', 'O'),
    ('architecture', 'O'),
    ('and', 'O'),
    ('on', 'O'),
    ('the', 'O'),
    ('west', 'O'),
    ('coast', 'O'),
    (',', 'O'),
    ('Los', 'B-LOC'),
    ('Angeles', 'L-LOC'),
    ("'", 'O'),
    ('Hollywood', 'U-LOC'),
    ('is', 'O'),
    ('famed', 'O'),
    ('for', 'O'),
    ('filmmaking', 'O')]

    # Merging Multiword NER Tags using convert_results
    allennlp_ner(document)
    
    # the output print like this

    {('Alaska', 'LOC'),
    ('Chicago', 'LOC'),
    ('DC', 'LOC'),
    ('Hawaii', 'LOC'),
    ('Hollywood', 'LOC'),
    ('Los', 'LOC'),
    ('Los Angeles', 'LOC'),
    ('Major', 'LOC'),
    ('Major Atlantic', 'LOC'),
    ('Major Atlantic Coast', 'LOC'),
    ('Midwestern', 'MISC'),
    ('New', 'LOC'),
    ('New York', 'LOC'),
    ('North', 'LOC'),
    ('North America', 'LOC'),
    ('Pacific', 'LOC'),
    ('Pacific Ocean', 'LOC'),
    ('U.S.', 'LOC'),
    ('Washington', 'LOC')}