用于在Python中提取带问号（？）的字符串的正则表达式_Python_Html_Regex

用于在Python中提取带问号（？）的字符串的正则表达式

python html regex

用于在Python中提取带问号（？）的字符串的正则表达式,python,html,regex,Python,Html,Regex,如何提取包含“？”的字符串（即与参数链接）？当我尝试使用： #!/usr/bin/python # -*- coding: utf-8 -*- import re html = """ <script type='text/javascript' src='http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'></script> <a href="h

如何提取包含“？”的字符串（即与参数链接）？当我尝试使用：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
html = """
<script type='text/javascript' src='http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'></script>
<a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
"""
print re.findall( r'(href=|src=)"([^"]*)"', html, re.U)
print re.findall( r'(href=|src=)"(.*?)"', html, re.U)

#/usr/bin/python
#-*-编码：utf-8-*-
进口稀土
html=”“”
"""
打印re.findall（r'（href=| src=）“（[^”]*）”，html，re.U）
打印re.findall（r'（href=| src=）“（.*）”，html，re.U）

字符串被忽略了。在第三组中分隔

？ver=1.3

会很好。有什么帮助吗？

它与字符

？

无关（我不知道为什么您会认为它是这样）

您没有使用字符

“

来分隔URL，而是使用字符

”

。只需将字符串更改为：

html = """
<script type='text/javascript' src="http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3"></script>
<a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
"""

属性值不仅由

“

包围，而且还由

”

包围

需要修改正则表达式：

print re.findall( r'''(href=|src=)["']([^"']*)["']''', html, re.U)

使用

[”]

匹配

“

或

”

更新

要获得

ver=1.3

part，最好使用（在Python 3.x中，）

>>重新导入
>>>导入URL解析
>>>
>>>html=”“”
... 
... 
... """
>>>对于attrname，re.findall（r''）中的值（href=|src=）[“”]（[^“']*）[“]''”，html，re.U）：
...     打印值，urlparse.urlparse（value.query）
...
http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3 ver=1.3
http://www.somesite.com/hard-circuit-editor-double-layout-design-now/

我没有使用任何“或”，但WordPress会生成它们。：（然而，Upvote。我的答案的目的是指出问题所在，但如果你不能更改字符串，falsetru的答案显然是要在这里做的。这是可行的，但当我将其设为r'（href=| src=）[']（href=|src=）[“]）（[^']*）\？（.？[“]']”来分隔第三组（后面？）它不识别其他链接。可以添加非强制性的组吗？尽管您的答案是正确的，但您可以帮助吗？

print re.findall( r'''(href=|src=)["']([^"']*)["']''', html, re.U)

>>> import re
>>> import urlparse
>>>
>>> html = """
... <script type='text/javascript' src='http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3'></script>
... <a href="http://www.somesite.com/hard-circuit-editor-double-layout-design-now/">
... """
>>> for attrname, value in re.findall(r'''(href=|src=)["']([^"']*)["']''', html, re.U):
...     print value, urlparse.urlparse(value).query
...
http://www.somesite.com/wp-content/themes/Dessa/scripts/jquery.easing.1.3.js?ver=1.3 ver=1.3
http://www.somesite.com/hard-circuit-editor-double-layout-design-now/