Python 是否有可能获得空间命名实体识别的置信度分数

Python 是否有可能获得空间命名实体识别的置信度分数,python,pandas,nlp,spacy,ner,Python,Pandas,Nlp,Spacy,Ner,我需要对Spacy NER所做的预测进行信心评分。 CSV文件 Text,Amount & Nature,Percent of Class "T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1) 100 E. Pratt Street,Not Listed,Not Listed "Baltimore, MD 21202",Not Listed,Not Listed "BlackRock, Inc.","21,871,854 (

我需要对Spacy NER所做的预测进行信心评分。

CSV文件

Text,Amount & Nature,Percent of Class
"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)
100 E. Pratt Street,Not Listed,Not Listed
"Baltimore, MD 21202",Not Listed,Not Listed
"BlackRock, Inc.","21,871,854 (2)",6.8% (2)
55 East 52nd Street,Not Listed,Not Listed
"New York, NY 10022",Not Listed,Not Listed
The Vanguard Group,"21,380,085 (3)",6.64% (3)
100 Vanguard Blvd.,Not Listed,Not Listed
"Malvern, PA 19355",Not Listed,Not Listed
FMR LLC,"20,784,414 (4)",6.459% (4)
245 Summer Street,Not Listed,Not Listed
"Boston, MA 02210",Not Listed,Not Listed
import pandas as pd
import spacy
with open('/path/table.csv') as csvfile:
    reader1 = csv.DictReader(csvfile)
    data1 =[["Text","Amount & Nature","Prediction"]]
    for row in reader1:
        AmountNature = row["Amount & Nature"]
        nlp = spacy.load('en_core_web_sm') 
        doc1 = nlp(row["Text"])

        for ent in doc1.ents:
            #output = [ent.text, ent.start_char, ent.end_char, ent.label_]
            label1 = ent.label_
            text1 = ent.text
        data1.append([str(doc1),AmountNature,label1])
my_df1 = pd.DataFrame(data1)
my_df1.columns = my_df1.iloc[0]
my_df1 = my_df1.drop(my_df1.index[[0]])
my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])
代码

Text,Amount & Nature,Percent of Class
"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)
100 E. Pratt Street,Not Listed,Not Listed
"Baltimore, MD 21202",Not Listed,Not Listed
"BlackRock, Inc.","21,871,854 (2)",6.8% (2)
55 East 52nd Street,Not Listed,Not Listed
"New York, NY 10022",Not Listed,Not Listed
The Vanguard Group,"21,380,085 (3)",6.64% (3)
100 Vanguard Blvd.,Not Listed,Not Listed
"Malvern, PA 19355",Not Listed,Not Listed
FMR LLC,"20,784,414 (4)",6.459% (4)
245 Summer Street,Not Listed,Not Listed
"Boston, MA 02210",Not Listed,Not Listed
import pandas as pd
import spacy
with open('/path/table.csv') as csvfile:
    reader1 = csv.DictReader(csvfile)
    data1 =[["Text","Amount & Nature","Prediction"]]
    for row in reader1:
        AmountNature = row["Amount & Nature"]
        nlp = spacy.load('en_core_web_sm') 
        doc1 = nlp(row["Text"])

        for ent in doc1.ents:
            #output = [ent.text, ent.start_char, ent.end_char, ent.label_]
            label1 = ent.label_
            text1 = ent.text
        data1.append([str(doc1),AmountNature,label1])
my_df1 = pd.DataFrame(data1)
my_df1.columns = my_df1.iloc[0]
my_df1 = my_df1.drop(my_df1.index[[0]])
my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])
输出CSV

Text,Amount & Nature,Prediction
"T. Rowe Price Associates, Inc.","28,223,360 (1)",ORG
100 E. Pratt Street,Not Listed,FAC
"Baltimore, MD 21202",Not Listed,CARDINAL
"BlackRock, Inc.","21,871,854 (2)",ORG
55 East 52nd Street,Not Listed,LOC
"New York, NY 10022",Not Listed,DATE
The Vanguard Group,"21,380,085 (3)",ORG
100 Vanguard Blvd.,Not Listed,FAC
"Malvern, PA 19355",Not Listed,DATE
FMR LLC,"20,784,414 (4)",ORG
245 Summer Street,Not Listed,CARDINAL
"Boston, MA 02210",Not Listed,GPE
在上述输出中,是否有可能获得Spacy NER预测的自信分数。如果是,我如何做到这一点?
有人能在这方面帮助我吗?

获取一个完全注释的数据集或自己手动注释它(鉴于您有一个CSV文件,这可能是您的首选选项)。这样你就可以从你的预言中分辨出地面真相。根据这一点,你可以计算出一个。我建议使用F1成绩作为信心的衡量标准


这里讨论的是各种公开可用的数据集和注释方法(包括CRF)。

不,在Spacy中无法获得模型的置信度分数(不幸的是)。如本期所述,如果使用
get_beam_parses
,则可以获得分数,尽管它似乎带有本期所述的一组问题


虽然使用F1分数有助于整体评估,但我更希望Spacy能为其预测提供个人信心分数,而Spacy目前没有提供这些分数。

对此没有直接的解释。 首先,
spaCy
为命名实体解析实现两个不同的目标:

  • 贪婪模仿学习目标。此目标询问:“如果在此状态下执行,哪些可用操作不会引入新错误?”

  • 全球光束搜索目标。全局模型不是优化单个转换决策,而是询问最终解析是否正确。为了优化这个目标,我们构建了一组top-k最有可能不正确的解析和top-k最有可能正确的解析

  • 请从中找到完整的解释和代码灵感

    注意:
    spaCy v2.0.13上进行了测试

    导入空间
    导入系统
    从集合导入defaultdict
    nlp=spacy.load('en')
    text='你好!希望你做得很好。来自印度的问候。”
    使用nlp。禁用_管道(“ner”):
    doc=nlp(文本)
    阈值=0.2
    考虑替代分析的数量。越慢越好——你需要在你的问题上进行实验。
    波束宽度=16
    #这将在每个步骤中剪辑解决方案。我们将排名靠前的动作的得分乘以该值,并将结果用作阈值。这可以防止解析器探索看起来不太可能的选项,从而节省一点效率。准确度也可能会提高,因为我们已经在贪婪目标上进行了训练。
    光束密度=0.0001
    梁,梁=nlp.entity.beam解析([doc],梁宽度,梁密度)
    实体分数=默认DICT(浮动)
    对于梁中梁:
    对于分数,nlp.entity.moves.get_beam_parses(beam)中的ents:
    对于ents中的开始、结束和标签:
    实体_分数[(开始、结束、标签)]+=分数
    对于输入实体_分数:
    开始、结束、标签=键
    分数=实体分数[关键]
    如果分数>阈值:
    打印('Label:{},Text:{},Score:{}'。格式(Label,doc[start:end],Score))
    
    输出:

    Label: GPE, Text: India, Score: 0.9999509961251819
    

    嗨,在寻找信心分数方面有什么进展吗?