Python 有没有一种方法可以在Spacy';谁是复述者?

Python 有没有一种方法可以在Spacy';谁是复述者?,python,split,nlp,spacy,Python,Split,Nlp,Spacy,我正在尝试修复文本文件中仅合并的西班牙语单词,我正在使用Spacy的retokenizer.split,但是,我想概括retokenizer.split中的orth参数。我有下一个密码 doc= nlp("the wordsare wronly merged and weneed split them") #example words = ["wordsare"] # Example: words to be split matcher = PhraseM

我正在尝试修复文本文件中仅合并的西班牙语单词,我正在使用Spacy的retokenizer.split,但是,我想概括retokenizer.split中的orth参数。我有下一个密码

doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare"] # Example: words to be split
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        heads = [(doc[start],1), doc[start]]
        attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
        orths= [str(doc[start]),str(doc[end])]
    retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split) 

但是当我这样放置ORHS时,
ORHS=[str(doc[start]),str(doc[end])]
而不是
[“words”,“are”]
我得到了这个错误:

ValueError:[E117]新拆分的令牌必须与原始令牌的文本匹配。新词:仅限单词。旧文本:wordsare.


我想得到一些关于这方面的帮助,因为我想要的代码不仅仅是修复单词单词sare,还包括单词weneed,以及文件可能具有的其他内容。

在您的示例中,我要更改的是:

  • words=[“wordsare”]
    words=[“wordsare”,“weneed”]
    这是拼写错误的单词列表

  • 将用于拆分该映射的规则添加到第一个列表:
    splits={“wordsare”:[“words”,“are”],“weeed”:[“we”,“need”]}

  • orths=[str(doc[start])、str(doc[end])]
    to
    orths=splits[doc[start:end].text]
    这是替换找到的匹配项的拆分列表。您原来的
    [str(doc[start])、str(doc[end])]
    没有太多意义

  • 移动
    retokenizer。将
    拆分到循环中

  • 考虑为
    attrs

  • 一旦你有了它的位置,你就有了一个工作和概括的例子:

    import spacy
    from spacy.matcher import PhraseMatcher
    nlp = spacy.load("en_core_web_sm")
    
    doc= nlp("the wordsare wronly merged and weneed split them") #example
    words = ["wordsare","weneed"] # Example: words to be split
    splits = {"wordsare":["words","are"], "weneed":["we","need"]}
    matcher = PhraseMatcher(nlp.vocab)
    patterns = [nlp.make_doc(text) for text in words]
    matcher.add("Terminology", None, *patterns)
    matches = matcher(doc)
    
    with doc.retokenize() as retokenizer:
        for match_id, start, end in matches:
            heads = [(doc[start],1), doc[start]]
            attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
            orths= splits[doc[start:end].text]           
            retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
    token_split=[token.text for token in doc]
    print(token_split) 
    ['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
    
    请注意,如果您只关心标记化,那么有一种更简单、也许更快的方法来实现这一点:

    [splits[tok.text] if tok.text in words else tok.text for tok in doc]
    ['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
    
    另请注意,在第一个示例中,
    attrs
    是固定的,并且在某些情况下分配错误。您可以通过制作另一本词典来解决这一问题,但拥有一个完整功能管道的更为严格和干净的方法是重新定义标记器,并让
    spacy
    为您完成其余的工作:

    from spacy.tokens import Doc
    nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
    doc2 = nlp("the wordsare wronly merged and weneed split them")
    for tok in doc2:
        print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
    the        DET        det       
    words      NOUN       nsubjpass 
    are        AUX        auxpass   
    wronly     ADV        advmod    
    merged     VERB       ROOT      
    and        CCONJ      cc        
    we         PRON       nsubj     
    need       VERB       aux       
    split      VERB       conj      
    them       PRON       dobj 
    
    来自spacy.tokens导入文档
    nlp.make_doc=lambda txt:doc(nlp.vocab,[i代表l,如果tok.text在words中,则拆分[tok.text],否则[tok.text]代表nlp.tokenizer中的tok(txt)],代表i在l中)
    doc2=nlp(“这些词只合并,我们需要拆分它们”)
    对于doc2中的tok:
    
    print(f{tok.text:在您的示例中,我要更改的是:

  • words=[“wordsare”]
    words=[“wordsare”,“weneed”]
    这是拼写错误的单词列表

  • 将用于拆分该映射的规则添加到第一个列表:
    splits={“wordsare”:[“words”,“are”],“weeed”:[“we”,“need”]}

  • orths=[str(doc[start])、str(doc[end])]
    to
    orths=splits[doc[start:end].text]
    这是替换找到的匹配项的拆分列表。您原来的
    [str(doc[start])、str(doc[end])]
    没有太多意义

  • 移动
    retokenizer。将
    拆分到循环中

  • 考虑为
    attrs

  • 一旦你有了它的位置,你就有了一个工作和概括的例子:

    import spacy
    from spacy.matcher import PhraseMatcher
    nlp = spacy.load("en_core_web_sm")
    
    doc= nlp("the wordsare wronly merged and weneed split them") #example
    words = ["wordsare","weneed"] # Example: words to be split
    splits = {"wordsare":["words","are"], "weneed":["we","need"]}
    matcher = PhraseMatcher(nlp.vocab)
    patterns = [nlp.make_doc(text) for text in words]
    matcher.add("Terminology", None, *patterns)
    matches = matcher(doc)
    
    with doc.retokenize() as retokenizer:
        for match_id, start, end in matches:
            heads = [(doc[start],1), doc[start]]
            attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
            orths= splits[doc[start:end].text]           
            retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
    token_split=[token.text for token in doc]
    print(token_split) 
    ['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
    
    请注意,如果您只关心标记化,那么有一种更简单、也许更快的方法来实现这一点:

    [splits[tok.text] if tok.text in words else tok.text for tok in doc]
    ['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
    
    另外请注意,在第一个示例中,
    attrs
    在某些情况下是固定的和错误分配的。您可以通过制作另一个字典来解决这一问题,但更严格和干净的方法是重新定义标记器,并让
    spacy
    为您完成其余工作:

    from spacy.tokens import Doc
    nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
    doc2 = nlp("the wordsare wronly merged and weneed split them")
    for tok in doc2:
        print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
    the        DET        det       
    words      NOUN       nsubjpass 
    are        AUX        auxpass   
    wronly     ADV        advmod    
    merged     VERB       ROOT      
    and        CCONJ      cc        
    we         PRON       nsubj     
    need       VERB       aux       
    split      VERB       conj      
    them       PRON       dobj 
    
    来自spacy.tokens导入文档
    nlp.make_doc=lambda txt:doc(nlp.vocab,[i代表l,如果tok.text在words中,则拆分[tok.text],否则[tok.text]代表nlp.tokenizer中的tok(txt)],代表i在l中)
    doc2=nlp(“这些词只合并,我们需要拆分它们”)
    对于doc2中的tok:
    
    print(f“{tok.text:非常感谢Sergey,我会检查它,但我仍然对广义拆分={“wordsare”:[“words”,“are”],“weneed”:[“we”,“need”])有疑问。)“因为它是手动的,你需要提供分裂的规则,没有办法绕过它。”ILIANAAVARGAS回答了你的问题吗?请考虑这么多,谢尔盖,我会检查它,但我仍然怀疑一般分裂= {“WordsAs”:[ [词],“是”],“WEENED”:[我们],“需要”] }。因为它是手动的,你需要提供分裂的规则,没有办法绕过它。“ILIANAVAVARG它回答你的问题吗?有帮助吗?请考虑。