Python中的非贪婪正则表达式

Python中的非贪婪正则表达式,python,Python,鉴于案文: 'Adf Adf asdf asdf asfdfhttps://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdfhttps://.com/abcabcabc\n kdfja ladsjfladsjf LADSJF ladsjfl adsfhttps://.com/djflkajdsfl\n\n djldjfld djfladjf ldfdjlkfj ldfj' 如何匹配表单中的任何

鉴于案文:

'Adf Adf asdf asdf asfdfhttps://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdfhttps://.com/abcabcabc\n kdfja ladsjfladsjf LADSJF ladsjfl adsfhttps://.com/djflkajdsfl\n\n djldjfld djfladjf ldfdjlkfj ldfj'

如何匹配表单中的任何urlhttps://.com/subdir[直到碰到空格或新行、逗号或句号]

尝试:

re.findall('http.*',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* ',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.* ?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* {1}?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.* +?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.*[^ \n]',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.*[^ \\n]',s) ['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.*[^ \\\n]',s) ['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* *?',s) ['https://imgur.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://imgur.com/abcabcabc', 'https://somepage.com/djflkajdsfl']
re.findall('http.',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdfhttps://.com/abcabcabc', 'https://.com/djflkajdsfl']
关于findall('http.*',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf']
关于findall('http.*?',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdfhttps://.com/abcabcabc', 'https://.com/djflkajdsfl']
关于findall('http.*{1}?',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf']
关于findall('http.*+?',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf']
关于findall('http.*[^\n]',s)
['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf
https://.com/abcabcabc', 'https://.com/djflkajdsfl']
关于findall('http.*[^\\n]',s)['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf
https://.com/abcabcabc', 'https://.com/djflkajdsfl']
关于findall('http.*[^\\\n]',s)['https://.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdf
https://.com/abcabcabc', 'https://.com/djflkajdsfl']
关于findall('http.*?',s)['https://imgur.com/abcabcabc kdfja ladsjfladsjf LADSJF ladsjfl adsfadf adf asdf asdf asfdfhttps://imgur.com/abcabcabc', 'https://somepage.com/djflkajdsfl']
尝试以下操作:

re.findall('http[^ \n,]*',s)
re.findall('http[^ \\n,]*/[^ \\n,\.]*',s)
您可以查看此操作

因为您使用的是
,所以懒惰(
*?
)和贪婪(
*
)都不适合您。懒惰将只移动一个角色然后停止,而贪婪将继续直到结束

相反,您希望指定不需要的字符。(
[^\n,]
)并对其进行搜索。由于您希望在这些字符的第一个实例处停止,因此需要使用贪婪搜索来完成此操作

由于
字符在URL中是合法的,因此很难基于此限制字符串。由于您始终希望包含子目录,因此可以通过以下操作完成此操作:

re.findall('http[^ \n,]*',s)
re.findall('http[^ \\n,]*/[^ \\n,\.]*',s)

您可以实际查看它。

第一个示例中的问题不是regexp匹配的空格太多;它在空格前匹配了太多的字母。因此,不要将“非贪婪的”
修饰符放在空格之后,将其放在
*
之后,因为这是当前匹配过多的内容

py3.7 >>> re.findall('http.*? ', s)
['https://.com/abcabcabc ']

另一方面,
[^\n]
不是任何类型的修饰符–它本身就是一个完全匹配的表达式。因此,将其放在现有表达式之后不会使其匹配度降低;现在有两个匹配表达式,它们一起匹配更多

您必须使用它来代替匹配过多的表达式,即代替

py3.7 >>> re.findall('http[^ \n]*', s)
['https://.com/abcabcabc', 'https://.com/abcabcabc', 'https://.com/djflkajdsfl']