Python 使用正则表达式在模式之间提取文本_Python_Regex

Python 使用正则表达式在模式之间提取文本

python regex

Python 使用正则表达式在模式之间提取文本,python,regex,Python,Regex,我需要python中正则表达式的帮助我有一个大的html文件[大约400行]具有以下模式 text here(div,span,img tags)  text here(div,span,img tags) 给定的模式在html文件中是唯一的。>d=“” >>> d = """ ... Some text here(div,span,img tags) ... ... <!-- 3GP||**Some link

我需要python中正则表达式的帮助

我有一个大的html文件[大约400行]具有以下模式

text here(div,span,img tags)

<!-- 3GP||Link|| --> 

text here(div,span,img tags)

给定的模式在html文件中是唯一的。

>d=“”
>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']

…此处有一些文本（div、span、img标记）
...
... 
...
…此处有一些文本（div、span、img标记）
... """
>>>进口稀土
>>>关于findall（r'\

re.findall

返回字符串中re模式的所有非重叠匹配项，如果re模式中存在组表达式，则返回该表达式

重新导入
关于匹配（r“”）。组（1）

生成

“Link”

如果需要解析其他内容，还可以将正则表达式与以下内容结合使用：

重新导入
从BeautifulSoup导入BeautifulSoup，注释
汤=美汤
link\u regex=re.compile（'\s+3GP\\\\\\\\\\\\\\\（.*）\\\\\\\\\\\s+'））
comment=soup.find（text=lambda text:isinstance（text，comment）
和链接_regex.match（文本））
link=link_regex.match（comment）.group（1）
打印链接

请注意，在这种情况下，常规表达式只需要与注释内容匹配，因为BeautifulSoup已经负责从注释中提取文本。

谢谢。它起作用了。如果您不介意，请向我解释一下您在那里做了什么。我想严格来说，

不需要在这里转义，但是这没什么坏处，它们在其他模式实现中都是元字符。谢谢。这是一个很好的解释。你能给我推荐一些学习正则表达式的好教程吗？问题是有太多的教程可供选择。遗憾的是，没有。我建议你阅读re模块的python文档，在遇到困难时进行实验并提出问题。Decent语法highlighter可能有帮助。我的html格式太不正确，这就是为什么我不使用beautiful soup。我明白了，然后我同意最好的选择是使用正则表达式。是的，这就是我要做的

>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']

import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)

import re
from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
                    and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link