每行文字到单词的转换+;Python中的命名实体标记

每行文字到单词的转换+;Python中的命名实体标记,python,string,list,named-entity-recognition,ner,Python,String,List,Named Entity Recognition,Ner,我正在制作一个命名实体识别器,我正在努力使用Python将数据转换成正确的格式。我拥有的是一个特定的字符串和文本中带有归属标记的命名实体列表。例如: text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.” 这个字符串也可以是“[[Hidden Figures]]是一部2016年

我正在制作一个命名实体识别器,我正在努力使用Python将数据转换成正确的格式。我拥有的是一个特定的字符串和文本中带有归属标记的命名实体列表。例如:

text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.”
这个字符串也可以是“[[Hidden Figures]]是一部2016年[[Theodore Melfi]]导演并由[[Melfi]]和[[Allison Schroeder]]撰写的[[American]]传记剧电影。”如果这样做更简单的话。

listOfNEsAndTags = [‘Hidden Figures PRO’, 'American LOC’, 'Theodore Melfi PER’, 'Melfi PER’, 'Allison Schroeder PER’]
我想要的输出是:

Hidden PRO
Figures PRO
is O
a O
2016 O
American LOC
biographical O
drama O
film O
directed O
by O
Theodore PER
Melfi PER
and O
written O
by O
Melfi PER
and O 
Allison PER
Schroeder PER 
. O
到目前为止,我只得到了以下函数:

def wordPerLine(text, neplustags): 
    text = re.sub(r"([?!,.]+)", r" \1 ", text) 
    wpl = text.split() 
    output = [] 
    for line in wpl: 
        output.append(line + ” O") 
    return output

它为每一行提供默认标记O(非命名实体的标记)。如何使文本中的命名实体获得正确的标记?

这可能会起作用,用其他内容替换打印,并且需要对正则表达式进行细化,但这是一个良好的开端

text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."

tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}

text = re.sub(r"([?!,.]+)", r" \1", text)

search = ""
inTag = False

for w in text.split(" "):
    outTag = False

    rest = w

    if rest[:2] == "[[":
        rest = rest[2:]
        inTag = True
    if rest[-2:] == "]]":
        rest = rest[:-2]
        outTag = True

    if inTag:
        search += rest
        if outTag:
            val = tags[search]
            for word in search.split():
                print(word + ": " + val)
            inTag = False
            search = ""
        else:
            search += " "
    else:
        print(rest + ": O")
输入:

[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
输出:

Hidden: PRO
test: PRO
Figures: PRO
is: O
,: O
a: O
2016: O
American: LOC
biographical: O
drama: O
film: O
directed: O
by: O
Theodore: PER
Melfi: PER
and: O
written: O
by: O
Melfi: PER
and: O
Allison: PER
Schroeder: PER
.: O

这可能会起作用,用其他东西替换打印,并且需要对正则表达式进行细化,但这是一个好的开始

text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."

tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}

text = re.sub(r"([?!,.]+)", r" \1", text)

search = ""
inTag = False

for w in text.split(" "):
    outTag = False

    rest = w

    if rest[:2] == "[[":
        rest = rest[2:]
        inTag = True
    if rest[-2:] == "]]":
        rest = rest[:-2]
        outTag = True

    if inTag:
        search += rest
        if outTag:
            val = tags[search]
            for word in search.split():
                print(word + ": " + val)
            inTag = False
            search = ""
        else:
            search += " "
    else:
        print(rest + ": O")
输入:

[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
输出:

Hidden: PRO
test: PRO
Figures: PRO
is: O
,: O
a: O
2016: O
American: LOC
biographical: O
drama: O
film: O
directed: O
by: O
Theodore: PER
Melfi: PER
and: O
written: O
by: O
Melfi: PER
and: O
Allison: PER
Schroeder: PER
.: O

“listOfNEsAndTags”是您首先获取文本的方式吗?你的问题是如何写的,这还不清楚。我添加了一些关于我是如何得到文本的信息。我认为你缺少了正则表达式分组的括号。谢谢,修复了它。复制粘贴我的代码时出错。是你格式化了NesandTags列表的吗?使用字典(哈希表)将每个命名实体映射到它的类型可能更容易。首先“listOfNEsAndTags”是如何获取文本的?你的问题是如何写的,这还不清楚。我添加了一些关于我是如何得到文本的信息。我认为你缺少了正则表达式分组的括号。谢谢,修复了它。复制粘贴我的代码时出错。是你格式化了NesandTags列表的吗?使用字典(哈希表)将每个命名实体映射到其类型可能更容易。