Python 使用正则表达式查找段落并在此段落内查找字符串_Python_Html_Regex_Html Parsing

Python 使用正则表达式查找段落并在此段落内查找字符串

python html regex

Python 使用正则表达式查找段落并在此段落内查找字符串,python,html,regex,html-parsing,Python,Html,Regex,Html Parsing,我在HTML页面中有如下几行： <div> <p class="match"> this sentence should match </p> some text <a class="a"> some text </a> </div> <div> <p class="match"> this sentence shouldnt match</p>

我在HTML页面中有如下几行：

<div>
    <p class="match"> this sentence should match </p> 
    some text
    <a class="a"> some text </a>  
</div>
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text
    <a class ="b"> some text </a> 
</div>

但我想知道是否有其他（仍然有效）的方法可以同时完成这项工作？

使用HTML解析器，如

用

类查找

标记，然后用class

匹配-p
标记：
from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

另请参见此处为什么应避免使用正则表达式解析HTML：


使用HTML解析器，如
用a
类查找a
标记，然后用class匹配-p
标记：
from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

另请参见此处为什么应避免使用正则表达式解析HTML：


您应该使用html解析器，但如果您仍然使用正则表达式，则可以使用如下内容：
<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

\s*（[\w\s]+）
[\w\s]+（？=
您应该使用html解析器，但如果您仍然使用正则表达式，则可以使用如下内容：
<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

\s*（[\w\s]+）
[\w\s]+（？=
\s*\n\s*？（.*？\s*\n\s*？\s*\n\s*（？=（
\s*\n\s*？（.*？\s*\n\s*？）=（\
尝试beautiful soup 4解析html文件。。尝试beautiful soup 4解析html文件。@user3683807请仔细阅读链接的线程-html解析器专门用于解析特定任务的html特定工具。我建议在此处使用beautiful soup
-它使html解析变得简单可靠。@user3683807请仔细阅读链接线程-html解析器专门用于解析特定任务的html特定工具。我建议在此处使用BeautifulSoup-它使html解析变得简单可靠。@Jerry正如我在回答中所建议的，我不会使用正则表达式来解析html。但我将答案发布为o使用正则表达式回答问题的选项。@Jerry正如我在回答中所建议的，我不会使用正则表达式来解析html。但我将答案作为使用正则表达式回答问题的选项发布。
 <div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))