Nlp 如何使用空格逻辑地分割句子?
我对Spacy是个新手,我试着从逻辑上分割一个句子,这样我就可以分别处理每个部分。e、 gNlp 如何使用空格逻辑地分割句子?,nlp,spacy,Nlp,Spacy,我对Spacy是个新手,我试着从逻辑上分割一个句子,这样我就可以分别处理每个部分。e、 g "If the country selected is 'US', then the zip code should be numeric" 这需要分为: If the country selected is 'US', then the zip code should be numeric If the country selected is 'US', then the zip code shou
"If the country selected is 'US', then the zip code should be numeric"
这需要分为:
If the country selected is 'US',
then the zip code should be numeric
If the country selected is 'US',
then the zip code should be numeric
另一句带有comas的句子不应被打断:
The allowed states are NY, NJ and CT
有什么想法和想法吗?如何在太空中做到这一点 在我们使用自定义数据训练模型之前,我不确定我们能否做到这一点。但是spacy允许添加标记和句子分割等规则 以下代码可能对这种特殊情况有用,您可以根据需要更改规则
#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
{'IS_ALPHA': True},
{'ORTH': "'"}]
#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)
#Method to merge matched patterns
def special_merger(doc):
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans:
span.merge()
return doc
#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
for token in doc:
if should_be_sentence_start(token):
token.is_sent_start = True
return doc
#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'" :
return True
else:
return False
#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')
#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
print(sent)
输出: