Python 句子中的超长词检查

Python 句子中的超长词检查,python,regex,file,sentence,Python,Regex,File,Sentence,我想在一个句子中检查是否有拉长的单词。例如,sooo、toooo、thaaatttt等等。现在我不知道用户可能会键入什么,因为我有一个句子列表,其中可能有或可能没有拉长的单词。如何在python中检查这一点。我是python新手。好吧,您可以列出逻辑上可能出现的每个细长单词。然后循环浏览句子中的单词,然后是列表中的单词,以找到较长的单词 sentence = "Hoow arre you doing?" elongated = ["hoow",'arre','youu','yoou','meee

我想在一个句子中检查是否有拉长的单词。例如,sooo、toooo、thaaatttt等等。现在我不知道用户可能会键入什么,因为我有一个句子列表,其中可能有或可能没有拉长的单词。如何在python中检查这一点。我是python新手。

好吧,您可以列出逻辑上可能出现的每个细长单词。然后循环浏览句子中的单词,然后是列表中的单词,以找到较长的单词

sentence = "Hoow arre you doing?"
elongated = ["hoow",'arre','youu','yoou','meee'] #You will need to have a much larger list
for word in sentence:
    word = word.lower()
    for e_word in elongated:
        if e_word == word:
            print "Found an elongated word!"
如果你想照休·博思韦尔说的做,那么:

sentence = "Hooow arrre you doooing?"
elongations = ["aaa","ooo","rrr","bbb","ccc"]#continue for all the letters 
for word in sentence:
    for x in elongations:
        if x in word.lower():
            print '"'+word+'" is an elongated word'

好吧,你可以列出逻辑上可能出现的每一个加长单词。然后循环浏览句子中的单词,然后是列表中的单词,以找到较长的单词

sentence = "Hoow arre you doing?"
elongated = ["hoow",'arre','youu','yoou','meee'] #You will need to have a much larger list
for word in sentence:
    word = word.lower()
    for e_word in elongated:
        if e_word == word:
            print "Found an elongated word!"
如果你想照休·博思韦尔说的做,那么:

sentence = "Hooow arrre you doooing?"
elongations = ["aaa","ooo","rrr","bbb","ccc"]#continue for all the letters 
for word in sentence:
    for x in elongations:
        if x in word.lower():
            print '"'+word+'" is an elongated word'
试试这个:

import re
s1 = "This has no long words"
s2 = "This has oooone long word"

def has_long(sentence):
    elong = re.compile("([a-zA-Z])\\1{2,}")
    return bool(elong.search(sentence))


print has_long(s1)
False
print has_long(s2)
True
试试这个:

import re
s1 = "This has no long words"
s2 = "This has oooone long word"

def has_long(sentence):
    elong = re.compile("([a-zA-Z])\\1{2,}")
    return bool(elong.search(sentence))


print has_long(s1)
False
print has_long(s2)
True

@HughBothwell有个好主意。据我所知,没有一个英语单词的同一个字母连续重复三次。因此,您可以搜索执行此操作的单词:

>>> from re import search
>>> mystr = "word word soooo word tooo thaaatttt word"
>>> [x for x in mystr.split() if search(r'(?i)[a-z]\1\1+', x)]
['soooo,', 'tooo', 'thaaatttt']
>>>

任何你找到的词都会被拉长。

@HughBothwell有个好主意。据我所知,没有一个英语单词的同一个字母连续重复三次。因此,您可以搜索执行此操作的单词:

sentence = "Hoow arre you doing?"
elongated = ["hoow",'arre','youu','yoou','meee'] #You will need to have a much larger list
for word in sentence:
    word = word.lower()
    for e_word in elongated:
        if e_word == word:
            print "Found an elongated word!"
>>> from re import search
>>> mystr = "word word soooo word tooo thaaatttt word"
>>> [x for x in mystr.split() if search(r'(?i)[a-z]\1\1+', x)]
['soooo,', 'tooo', 'thaaatttt']
>>>

您找到的任何单词都将被拉长。

您需要一份有效英语单词的参考资料。在*NIX系统上,您可以使用
/etc/share/dict/words
/usr/share/dict/words
或同等工具,并将所有单词存储到
集合
对象中

sentence = "Hoow arre you doing?"
elongated = ["hoow",'arre','youu','yoou','meee'] #You will need to have a much larger list
for word in sentence:
    word = word.lower()
    for e_word in elongated:
        if e_word == word:
            print "Found an elongated word!"
然后,你要检查句子中的每个单词

  • 该词本身不是一个有效词(即,
    词并非在所有词中都是有效词);及
  • 也就是说,当您将所有连续序列缩短为一个或两个字母时,新词就是有效词
  • 这里有一种方法可以尝试提取所有可能性:

    import re
    import itertools
    
    regex = re.compile(r'\w\1\1')
    
    all_words = set(get_all_words())
    
    def without_elongations(word):
        while re.search(regex, word) is not None:
            replacing_with_one_letter = re.sub(regex, r'\1', word, 1)
            replacing_with_two_letters = re.sub(regex, r'\1\1', word, 1)
            return list(itertools.chain(
                without_elongations(replacing_with_one_letter),
                without_elongations(replacing_with_two_letters),
            ))
    
    for word in sentence.split():
        if word not in all_words:
            if any(map(lambda w: w in all_words, without_elongations(word)):
                print('%(word) is elongated', { 'word': word })
    

    你需要有一个有效英语单词的参考。在*NIX系统上,您可以使用
    /etc/share/dict/words
    /usr/share/dict/words
    或同等工具,并将所有单词存储到
    集合
    对象中

    然后,你要检查句子中的每个单词

  • 该词本身不是一个有效词(即,
    词并非在所有词中都是有效词);及
  • 也就是说,当您将所有连续序列缩短为一个或两个字母时,新词就是有效词
  • 这里有一种方法可以尝试提取所有可能性:

    import re
    import itertools
    
    regex = re.compile(r'\w\1\1')
    
    all_words = set(get_all_words())
    
    def without_elongations(word):
        while re.search(regex, word) is not None:
            replacing_with_one_letter = re.sub(regex, r'\1', word, 1)
            replacing_with_two_letters = re.sub(regex, r'\1\1', word, 1)
            return list(itertools.chain(
                without_elongations(replacing_with_one_letter),
                without_elongations(replacing_with_two_letters),
            ))
    
    for word in sentence.split():
        if word not in all_words:
            if any(map(lambda w: w in all_words, without_elongations(word)):
                print('%(word) is elongated', { 'word': word })
    

    在我看来,你应该至少连续三次寻找任何包含同一字母的单词。我不知道有任何实际的英语单词能做到这一点。第一步是定义一个拉长的单词,无论是通过参考已知的词典还是通过诸如@HughBothwell建议的规则。@HughBothwell:KKK和WWW是唯一的“拉长”单词@HughBothwell:(the)包括一些拉长的单词,如:woooosh,pfffted,Aaawww,unnt,Sssshoo,BANNNNNG。在我看来,你应该至少连续三次寻找任何包含相同字母的单词。我不知道有任何实际的英语单词能做到这一点。第一步是定义一个拉长的单词,无论是通过参考已知的词典还是通过诸如@HughBothwell建议的规则。@HughBothwell:KKK和WWW是唯一的“拉长”单词@HughBothwell:(the)包括一些拉长的单词,如:woooosh,pfffted,Aaawww,unnt,Sssshoo,bannnng.
    True if X else False
    模式通常编写为
    bool(X)
    。谢谢,这样比较整洁!更新说明:
    [a-zA-Z]
    混合情况下失败:
    'Sssshoo'
    如果X为False,则
    模式通常为
    bool(X)
    。谢谢,这样更整洁!更新说明:
    [a-zA-Z]
    混合情况下失败:
    'Sssshoo'
    \w
    拾取诸如
    '8888'
    之类的数字。它不支持非ascii字母,例如
    “ёёё”
    ,因此您可以将其替换为注释:标记
    (?i)
    会处理它。它也适用于混合情况,例如
    'ssssshoo'
    ,也就是说,
    A-Z
    在这里是不必要的。
    \w
    拾取诸如
    '8888'
    之类的数字。它不支持非ascii字母,例如
    “ёёё”
    ,因此您可以将其替换为注释:标记
    (?i)
    会处理它。而且它也适用于混合情况,例如
    'ssssshoo'
    ,也就是说,
    A-Z
    在这里是不必要的。