python正则表达式查找以数字为中心的子字符串

python正则表达式查找以数字为中心的子字符串,python,regex,substring,Python,Regex,Substring,我有一根绳子。我想将字符串切割成子字符串,其中包括一个数字,其中包含一个单词,每边由(最多)4个单词包围。如果子字符串重叠,则应合并 Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated." re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext) desired

我有一根绳子。我想将字符串切割成子字符串,其中包括一个数字,其中包含一个单词,每边由(最多)4个单词包围。如果子字符串重叠,则应合并

Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext)
desired output = ['the way I know 54 how to take praise', 'to take praise for 65 excellent questions 34 thank you for asking']

重叠匹配:使用Lookaheads

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."    
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
    print(match.group(1))
这可以做到:

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b\w+\b ){4}\d+(?: \b\w+\b){4}))", subject):
    print(match.group(1))
什么是单词?

输出取决于您对单词的定义。这里,一句话,我允许数字。这将产生以下输出

输出(允许字中有数字)

选项2:文字中无数字

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."    
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
    print(match.group(1))
输出2

the way I know 54 how to take praise
选项3:扩展到四个不间断的非数字字

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."    
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
    print(match.group(1))
根据您的评论,此选项将延伸到轴的左侧和右侧,直到匹配四个不间断的非数字字。忽略逗号

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated. One Two Three Four 55 Extend 66 a b c d AA BB CC DD 71 HH DD, JJ FF"
for match in re.finditer(r"(?=((?:\b[a-z]+[ ,]+){4}(?:\d+ (?:[a-z]+ ){1,3}?)*?\d+.*?(?:[ ,]+[a-z]+){4}))", subject, re.IGNORECASE):
    print(match.group(1))
输出3

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank you for asking
One Two Three Four 55 Extend 66 a b c d
AA BB CC DD 71 HH DD, JJ FF

请见见我如何简化选项3表达式,以便将单词广泛定义为由空格(\S+)包围的非空格(/S+)组。数字是任何包含数字的“单词”。根据你的定义,“单词”也可以与“数字单词”匹配?你说“当然”,但对我来说,在这些情况下,如果你的单词也可以是非数字单词,那么在一个数字单词之外再加上四,意味着什么?对不起,我现在无法理解。如果你需要一个快速的答案,请发布一个新的问题。如果是这样,我建议你参考第一个问题。以后可能有时间,但不确定。:)