Nlp 如何使用空格逻辑地分割句子?

Nlp 如何使用空格逻辑地分割句子?,nlp,spacy,Nlp,Spacy,我对Spacy是个新手,我试着从逻辑上分割一个句子,这样我就可以分别处理每个部分。e、 g "If the country selected is 'US', then the zip code should be numeric" 这需要分为: If the country selected is 'US', then the zip code should be numeric If the country selected is 'US', then the zip code shou

我对Spacy是个新手,我试着从逻辑上分割一个句子,这样我就可以分别处理每个部分。e、 g

"If the country selected is 'US', then the zip code should be numeric"
这需要分为:

If the country selected is 'US',
then the zip code should be numeric
If the country selected is 'US',
then the zip code should be numeric
另一句带有comas的句子不应被打断:

The allowed states are NY, NJ and CT

有什么想法和想法吗?如何在太空中做到这一点

在我们使用自定义数据训练模型之前,我不确定我们能否做到这一点。但是spacy允许添加标记和句子分割等规则

以下代码可能对这种特殊情况有用,您可以根据需要更改规则

#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')

#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
           {'IS_ALPHA': True},
           {'ORTH': "'"}]


#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)


#Method to merge matched patterns
def special_merger(doc):
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:
        span.merge()
    return doc

#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
    for token in doc:
        if should_be_sentence_start(token):
            token.is_sent_start = True
    return doc

#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
    if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'"  :
        return True
    else:
        return False

#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')

#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
    print(sent)
输出: