Python 获取p中的文本，p不是'；t在另一个p中_Python_Web Scraping_Beautifulsoup

Python 获取p中的文本，p不是'；t在另一个p中

python web-scraping

Python 获取p中的文本，p不是'；t在另一个p中,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我在div中有如下值： <p> Example text I would like to scrape <p>Example text I do not want to scrape</p> </p> 示例文本我想刮示例文本我不想刮我如何才能只返回“我想刮取的示例文本”您可以使用正则表达式的re模块，以防你的刮擦文本包含一个特定的模式。下面是一个非常基本的模式示例，仅包含纯文本： import re pattern = re.co

我在div中有如下值：

<p>
Example text I would like to scrape
<p>Example text I do not want to scrape</p>
</p>


示例文本我想刮
示例文本我不想刮

我如何才能只返回“我想刮取的示例文本”

您可以使用正则表达式的re模块，以防你的刮擦文本包含一个特定的模式。下面是一个非常基本的模式示例，仅包含纯文本：

import re

pattern = re.compile(r"Example text I would like to scrape")

html_elements = """<p>
Example text I would like to scrape
<p>Example text I do not want to scrape</p>
</p>
"""
print(re.sub(pattern, "", html_elements))

重新导入
pattern=re.compile（r“示例文本我想要刮取”）
html_elements=“”
示例文本我想刮
示例文本我不想刮

"""
打印（关于sub（模式，“，html_元素））

您可以尝试：

from bs4 import BeautifulSoup

html_doc = """<p>
Example text I would like to scrape
<p>Example text I do not want to scrape</p>
</p>"""
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.p.text)

这就是我对正则表达式的处理方式。我们可以匹配任何（

）前面有

和换行符（

\n

），后面有换行符和

：

import re

pattern = re.compile("(?<=(<p>\n)).*(?=(\n<p>))")

html_elements = """<p>
Example text I would like to scrape
<p>Example text I do not want to scrape</p>
</p>"""

result = pattern.search(html_elements).group()
print(result)

重新导入
pattern=re.compile（“（？您想要的文本是否总是位于嵌套段落之前？如果是这样，则有点不太清楚，但您可以执行.split（“”）在你的文本上，然后获取正确的索引。更灵活的方法是使用正则表达式。@SimonR非常感谢！我正在尝试！事实上，我想它是相反的-我不想废弃的文本是常量。我明白你如何解决这个问题的观点！谢谢！@NorahJones
import re

pattern = re.compile("(?<=(<p>\n)).*(?=(\n<p>))")

html_elements = """<p>
Example text I would like to scrape
<p>Example text I do not want to scrape</p>
</p>"""

result = pattern.search(html_elements).group()
print(result)