Python 使用正则表达式作为标记器?

Python 使用正则表达式作为标记器?,python,regex,nlp,tokenize,Python,Regex,Nlp,Tokenize,我试图把我的语料库标记成句子。我试着使用spacy和nltk,但效果不好,因为我的文本有点棘手。下面是我制作的一个人工样本,涵盖了我所知道的所有边缘情况: It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised. However, the High Court while enhancing the s

我试图把我的语料库标记成句子。我试着使用spacy和nltk,但效果不好,因为我的文本有点棘手。下面是我制作的一个人工样本,涵盖了我所知道的所有边缘情况:

It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
我希望如何将句子标记化:

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised.
2) However, the High Court while enhancing the same from life to death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a rarest of rare case where extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4)15. In the light of the above discussion, while
 maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

下面是我正在使用的正则表达式:

sent = re.split('(?<!\w\.\w.)(?<![A-Z]\.)(?<![1-9]\.)(?<![1-9]\.)(?<![v]\.)(?<![vs]\.)(?<=\.|\?) ',j)

sent=re.split(')(?
一般来说,你不能依赖一个伟大的白色无误正则表达式,你必须编写一个使用多个正则表达式(正反两个)的函数;还有一个缩略语词典,以及一些基本的语言解析,例如“I”、“USA”、“FCC”、“TARP”是英文大写的。

根据这条准则,下面的函数使用几个正则表达式来解析您的句子

代码

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    digits = "([0-9])"
    section = "(Section \d+)([.])(?= \w)"
    item_number = "(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations = "(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized = "\((.*?)\)"
    bracketed = "\[(.*?)\]"
    curly_bracketed = "\{(.*?)\}"
    enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if "..." in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    text = re.sub(section,"\\1<prd>",text)
    text = re.sub(item_number,"\\1<prd>",text)
    text = re.sub(abbreviations, "\\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences
for index, token in enumerate(split_into_sentences(s), start = 1):
    print(f'{index}) {token}')
测试

1.输入

输出

2.输入

输出

3.输入

输出

4.输入

输出

5.输入

输出


您正在寻找以下正则表达式:

'(?<=[^A-Z][a-z]\w)[/.] '

'(?除了一件事之外,这真的很有效。当句号在大写的单词之前时,我如何让它被忽略?知道怎么做吗@DarryIG@Shawn--更新后的答案显示了您预期问题案例的解决方案。这是您期望的吗?嗨,Darry,恐怕不是。我想要的是在句点之前是否有大写字母例如-
例如。
否。
。问题是我不能手动放置它们,因为每次我看到一个新的。这里有一句话很麻烦:
被调查者,在他的陈述Ex.-73中,被接受并发现是真实的。
这是一个句子,但被分成两部分。我如何阻止它?我是好吧,如果这导致像这样的句子出现错误:
他去了纽约。他10岁了。
@Shawn——很高兴我能帮上忙。我认为这是一个有趣的挑战。
1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death  to one cannot be generalised.
2) However, the High Court while enhancing the same from life to  death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a  rarest of rare case where extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,  award of extreme penalty of death by the High Court is set aside and we restore the sentence of  life imprisonment as directed by the trial Court.
s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''
1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.
3) He'll take the No. 2 bus from the airport.
4) However, he may grab a taxi instead.
s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''
1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.
s = '''He went to New York. He is 10 years old.'''
1) He went to New York.
2) He is 10 years old.
s = '''15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''
1) 15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.
'(?<=[^A-Z][a-z]\w)[/.] '
sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)