Python:解析HTML以删除标记,并对标记后的所有文本应用文本转换
我试图检测包含HTML标记Python:解析HTML以删除标记,并对标记后的所有文本应用文本转换,python,regex,string,beautifulsoup,Python,Regex,String,Beautifulsoup,我试图检测包含HTML标记的字符串,以及标记“共享”或“便利设施”中的某些单词,并将单词“共享”附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点 输入示例: </strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, ten
的字符串,以及标记“共享”或“便利设施”
中的某些单词,并将单词“共享”
附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点
输入示例:
</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">
swimming pool, barbecue, beach shared, tennis courts shared
为此,您可以使用一些不同的库,常见的选择是BeautifulSoup或lxml。我更喜欢lxml,因为大多数语言都有类似于regex的实现,所以感觉我会从投资中获得更多收益
from lxml import html
stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)
从lxml导入html
stuff='游泳池、烧烤
共享俱乐部设施海滩、网球场
stuff=html.fromstring(stuff)
ptag=stuff.xpath('//p/*[包含(text(),“便利设施”)或包含(text(),“共享”)]//text()
印刷品(ptag)
我使用下面的代码实现了这一点。欢迎任何意见和建议
from bs4 import BeautifulSoup
html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]
shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
.get_text()
.strip()
)
shared_amenities = html_to_parse.split(shared_indicator,1)[1]
shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
.get_text()
.split(','))
.replace("[^A-Za-z0-9'`]+", " ", regex = True)
.str.strip()
.apply(lambda x: "{}{}".format(x, ' shared'))
)
shared_amenities_tagged = ", ".join(shared_amenities_array)
non_shared_amenities + ', ' + shared_amenities_tagged
从bs4导入美化组
html_to_parse='游泳池、烧烤
共享俱乐部设施海滩、网球场
'
soup=BeautifulSoup(html-to-parse)
html_body=soup('body')[0]
shared_indicator=html_body.find('strong','title')。get_text()
非共享便利设施=html\u to\u parse.split(共享指示符,1)[0]
非共享设施=(美化组(非共享设施,'html.parser'))
.get_text()
.strip()
)
shared_便利设施=html_to_parse.split(shared_指示符,1)[1]
共享设施数组=(pd.Series(BeautifulSoup)(共享设施“html.parser”)
.get_text()
.split(','))
.替换(“[^A-Za-z0-9'`]+”,regex=True)
.str.strip()
.apply(lambda x:“{}{}”.format(x,'shared'))
)
共享设施已标记=“,”。加入(共享设施阵列)
非共享设施+,“+共享设施”
Head start advice-不要使用正则表达式解析HTML;)@liborm你的评论比我强。。。。。