Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用spaCy 3.0将数据从旧的spaCy v2格式转换为全新的spaCy v3格式_Python_Nlp_Spacy_Data Conversion - Fatal编程技术网

Python 使用spaCy 3.0将数据从旧的spaCy v2格式转换为全新的spaCy v3格式

Python 使用spaCy 3.0将数据从旧的spaCy v2格式转换为全新的spaCy v3格式,python,nlp,spacy,data-conversion,Python,Nlp,Spacy,Data Conversion,我有一个变量trainData,它具有以下简化格式 [ (“A”段,{“实体”:[(15,26,“疾病类别”),(443449,“疾病类别”),(483496,“疾病类别”)], (“第B段”{“实体”:[(969975,“疾病类别”),(12571271,“特定疾病”)], (“第C款,{“实体”:[(0,27,'specific disease')])) ] 我正在尝试将trainData转换为.spacy,首先在doc中转换,然后再转换为DocBin。可通过访问整个trainData文件

我有一个变量
trainData
,它具有以下简化格式

[
(“A”段,{“实体”:[(15,26,“疾病类别”),(443449,“疾病类别”),(483496,“疾病类别”)],
(“第B段”{“实体”:[(969975,“疾病类别”),(12571271,“特定疾病”)],
(“第C款,{“实体”:[(0,27,'specific disease')]))
]
我正在尝试将
trainData
转换为
.spacy
,首先在
doc
中转换,然后再转换为
DocBin
。可通过访问整个
trainData
文件

我试图复制本教程中提到的内容,但没有成功。本教程是:


我尝试了以下方法

导入空间
从spacy.tokens导入DocBin
nlp=spacy.blank(“en”)#加载新的spacy模型
db=DocBin()#创建一个DocBin对象
对于文本,不在trainData中注释:#以前格式的数据
doc=nlp.make_doc(text)#从文本创建doc对象
ents=[]
对于开始、结束,在注释[“实体”]中添加标签:#添加字符索引
span=doc.char\u span(开始、结束、标签=标签、对齐方式=“合同”)
ents.append(span)
doc.ents=span#用ents标记文本
数据库添加(文档)
db.to_disk(“./train.spacy”)#保存docbin对象
但是我在代码中错误地理解了如何将数据从
spacyv2
转换为
spacyv3
。 在上面的代码片段中,我得到了一个回溯:
TypeError:'spacy.tokens.token.token'对象不可移植

您有一个小错误。检查XXX是否有更改的行

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in trainData: # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        ents.append(span)
    #XXX FOLLOWING LINE CHANGED
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

我在以下摘要的实体中发现了问题:

[马查多-约瑟夫病,马查多-约瑟夫病,美赞臣病,美赞臣病,美赞臣病,亨廷顿病,HD,HD,美赞臣病,马查多-约瑟夫病,美赞臣病,美赞臣病,美赞臣病,美赞臣病,美赞臣病,亨廷顿病,HD,HD,美赞臣病]

摘要如下:

8528200|t|Evidence for inter-generational instability in the CAG repeat in the MJD1 gene and for conserved haplotypes at flanking markers amongst Japanese and Caucasian subjects with Machado-Joseph disease.
8528200|a|The size of the (CAG)n repeat array in the 3' end of the MJD1 gene and the haplotype at a series of microsatellite markers surrounding the MJD1 gene were examined in a large cohort of Japanese and Caucasian subjects affected with Machado-Joseph disease (MJD). Our data provide five novel observations. First, MJD is associated with expansion fo the array from the normal range of 14-37 repeats to 68-84 repeats in most Japanese and Caucasian subjects, but no subjects were observed with expansions intermediate in size between those of the normal and MJD affected groups. Second, the expanded allele associated with MJD displays inter-generational instability, particularly in male meioses, and this instability was associated with the clinical phenomenon of anticipation. Third, the size of the expanded allele is not only inversely correlated with the age-of-onset of MJD (r = -0.738, p < 0.001), but is also correlated with the frequency of other clinical features [e.g. pseudoexophthalmos and pyramidal signs were more frequent in subjects with large repeats (p < 0.001 and p < 0.05 respectively)]. Fourth, the disease phenotype is significantly more severe and had an early age of onset (16 years) in a subject homozygous for the expanded allele, which contrasts with Huntington disease and suggests that the expanded allele in the MJD1 gene could exert its effect either by a dominant negative effect (putatively excluded in HD) or by a gain of function effect as proposed for HD. Finally, Japanese and Caucasian subjects affected with MJD share haplotypes at several markers surrounding the MJD1 gene, which are uncommon in the normal Japanese and Caucasian population, and which suggests the existence either of common founders in these populations or of chromosomes susceptible to pathologic expansion of the CAG repeat in the MJD1 gene.
8528200 173 195 Machado-Joseph disease  SpecificDisease D017827
8528200 427 449 Machado-Joseph disease  SpecificDisease D017827
8528200 451 454 MJD SpecificDisease D017827
8528200 506 509 MJD SpecificDisease D017827
8528200 748 751 MJD Modifier    D017827
8528200 813 816 MJD SpecificDisease D017827
8528200 1067    1070    MJD SpecificDisease D017827
8528200 1470    1488    Huntington disease  SpecificDisease D006816
8528200 1628    1630    HD  SpecificDisease D006816
8528200 1680    1682    HD  SpecificDisease D006816
8528200 1739    1742    MJD SpecificDisease D017827

函数
converter()
运行良好,但我忽略了前面提到的实体。我仍然不知道如何处理这样的情况,让SPACE不能把它看作是重复,而不是仅仅忽略它。

不确定它是否是你唯一的问题,但是<代码> DOC。假设您的批注没有问题。请问可能有哪些问题?非常感谢,但刚刚测试的代码得到了此回溯:
ValueError:[E1010]无法为令牌27设置实体信息,令牌27包含在实体中的多个跨距中,被阻止、丢失或在外部。
是的,这是您的实体批注的问题。这就像是说,在
I li[ke che]ese
中,括号中的部分是一个人。如果你需要这方面的帮助,可以用示例数据来回答一个问题。啊,实际上,看起来你在同一个标记上有两个注释,或者什么的。。。?尽管如此,这仍然是一个注释问题,在没有看到注释的情况下无法修复。我提供了可用的注释。我更改了
alignment\u mode=“strict”
,并让它与您的代码相同。我得到了回溯:
TypeError:type'NoneType'的对象在
doc.ents=ents中没有len()