Python 3.x 如何从多个HTML标记检索文本数据？_Python 3.x_Regex_Web Scraping_Xpath_Beautifulsoup

Python 3.x 如何从多个HTML标记检索文本数据？

python-3.x regex web-scraping xpath

Python 3.x 如何从多个HTML标记检索文本数据？,python-3.x,regex,web-scraping,xpath,beautifulsoup,Python 3.x,Regex,Web Scraping,Xpath,Beautifulsoup,我将以下HTML代码段输出存储在类型为bs4.element.Tag的名为content的变量中 α-生育酚看见 str（内容）输出： ”\nα-生育酚\n见\n\n 我想使用Python作为输出：['Alpha-tocopherol'，'vitamine']。我尝试了以下方法，但它是错误的： regex=re.compile（“（\w+\s+）\n”） regex.sub（“”，content.text）.split（）这两个选项都将生成您想要的列表。但对于第一个，这取决于如何解

我将以下HTML代码段输出存储在类型为

bs4.element.Tag

的名为

content

的变量中


α-生育酚
看见

str（内容）

输出：

”\nα-生育酚\n见\n\n

我想使用Python作为输出：

['Alpha-tocopherol'，'vitamine']

。我尝试了以下方法，但它是错误的：

regex=re.compile（“（\w+\s+）\n”）
regex.sub（“”，content.text）.split（）

这两个选项都将生成您想要的列表。但对于第一个，这取决于如何解析html元素。如果出现分页符，\n则必须进行一些额外的分析

html = '''<li class="item">Alpha-tocopherol<em>see</em><a href="https://medlineplus.gov/vitamine.html">Vitamin E</a></li>'''
soup = BeautifulSoup(html, "html.parser")

soup.text.split('see') # option 1, get all text and parse accordingly from soup object

soup.find('li', class_='item').text.split('see') # option 2, get text from li element (seems like it'd be less efficient to do this)

您可以使用该方法获取第一个标记，然后使用该方法搜索

标记

from bs4 import BeautifulSoup

html = """
<li class="item">
Alpha-tocopherol
<em>see</em>
<a href="https://medlineplus.gov/vitamine.html">Vitamin E</a>
</li>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("li", class_="item"):
    print([tag.contents[0].strip(), tag.find_next("a").text])

改用DOM解析器。

from bs4 import BeautifulSoup

html = """
<li class="item">
Alpha-tocopherol
<em>see</em>
<a href="https://medlineplus.gov/vitamine.html">Vitamin E</a>
</li>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("li", class_="item"):
    print([tag.contents[0].strip(), tag.find_next("a").text])

['Alpha-tocopherol', 'Vitamin E']