在python中提取Span标记的内容

在python中提取Span标记的内容,python,Python,我想从中提取1包、4包礼品套装、1支带橡皮擦的铅笔, [<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Penci

我想从中提取1包、4包礼品套装、1支带橡皮擦的铅笔,

[<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Pencil with Lead and Erasers</span>] 
[1包,4包礼品套装,1支带橡皮擦的铅笔,1支带铅和橡皮擦的铅笔]
在python中


谢谢

最简单的方法是使用Beautiful Soup,这是一个用于解析HTML的事实上的Python库。获取或使用
pip安装bs4

from bs4 import BeautifulSoup

string = '[<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Pencil with Lead and Erasers</span>]'

# Represent the string as a nested data structure
soup = BeautifulSoup(string, "html.parser")
# Find all <span> tags in the BeautifulSoup object
spans = soup.find_all('span')
# Get the text inside the <span> tags
print([span.text for span in spans])
使用标准库re(正则表达式操作)

重新导入
标签='1包,4包礼品套装,1支带橡皮擦的铅笔,1支带铅和橡皮擦的铅笔'
cleanr=re.compile(“”)
cleantext=re.sub(cleanr',标记)
打印干净的文本

输出为:1包、4包礼品套装、1支带橡皮擦的铅笔、1支带铅和橡皮擦的铅笔

您能详细说明您的问题和数据结构吗?假设您的数据结构是字符串列表:

import re
l = ['<span class="a-size-base">1 Pack</span>', '<span class="a-size-base">4 Pack Gift Set</span>', '<span class="a-size-base">1 Pencil with Erasers</span>', '<span class="a-size-base">1 Pencil with Lead and Erasers</span>']
print([re.match(r'<([a-zA-Z]+).+>(.+)</\1>', i).group(2) for i in l])
重新导入
l=['1包','4包礼品套装','1支带橡皮擦的铅笔','1支带铅和橡皮擦的铅笔']
打印([re.match(r'(.+)',i).group(2)表示l中的i])

如果你想要答案,你必须提供一个问题。:)你能展示一下你试过的吗?
import re

tag = '<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Pencil with Lead and Erasers</span>'

cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', tag)
print cleantext
import re
l = ['<span class="a-size-base">1 Pack</span>', '<span class="a-size-base">4 Pack Gift Set</span>', '<span class="a-size-base">1 Pencil with Erasers</span>', '<span class="a-size-base">1 Pencil with Lead and Erasers</span>']
print([re.match(r'<([a-zA-Z]+).+>(.+)</\1>', i).group(2) for i in l])