Python：解析HTML以删除标记，并对标记后的所有文本应用文本转换_Python_Regex_String_Beautifulsoup

Python：解析HTML以删除标记，并对标记后的所有文本应用文本转换

python regex string

Python：解析HTML以删除标记，并对标记后的所有文本应用文本转换,python,regex,string,beautifulsoup,Python,Regex,String,Beautifulsoup,我试图检测包含HTML标记的字符串，以及标记“共享”或“便利设施”中的某些单词，并将单词“共享”附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点输入示例： swimming pool, barbecue <hr /> SHARED CLUB AMENITIES beach, ten

我试图检测包含HTML标记

的字符串，以及标记
“共享”或“便利设施”
中的某些单词，并将单词
“共享”
附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点
输入示例：

 swimming pool, barbecue <hr /> SHARED CLUB AMENITIES beach, tennis courts <hr /> 

swimming pool, barbecue, beach shared, tennis courts shared

为此，您可以使用一些不同的库，常见的选择是BeautifulSoup或lxml。我更喜欢lxml，因为大多数语言都有类似于regex的实现，所以感觉我会从投资中获得更多收益

from lxml import html stuff = ' swimming pool, barbecue <hr /> SHARED CLUB AMENITIES beach, tennis courts <hr /> ' stuff = html.fromstring(stuff) ptag = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()') print(ptag)

从lxml导入html stuff='

游泳池、烧烤共享俱乐部设施海滩、网球场 stuff=html.fromstring（stuff） ptag=stuff.xpath（'//p/*[包含（text（），“便利设施”）或包含（text（），“共享”）]//text（）印刷品（ptag）

我使用下面的代码实现了这一点。欢迎任何意见和建议

from bs4 import BeautifulSoup html_to_parse = ' swimming pool, barbecue <hr /> SHARED CLUB AMENITIES beach, tennis courts <hr /> ' soup = BeautifulSoup(html_to_parse) html_body = soup('body')[0] shared_indicator = html_body.find('strong', 'title').get_text() non_shared_amenities = html_to_parse.split(shared_indicator,1)[0] non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser') .get_text() .strip() ) shared_amenities = html_to_parse.split(shared_indicator,1)[1] shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser') .get_text() .split(',')) .replace("[^A-Za-z0-9'`]+", " ", regex = True) .str.strip() .apply(lambda x: "{}{}".format(x, ' shared')) ) shared_amenities_tagged = ", ".join(shared_amenities_array) non_shared_amenities + ', ' + shared_amenities_tagged

从bs4导入美化组
html_to_parse='游泳池、烧烤
共享俱乐部设施海滩、网球场
'
soup=BeautifulSoup（html-to-parse）
html_body=soup（'body'）[0]
shared_indicator=html_body.find（'strong'，'title'）。get_text（）
非共享便利设施=html\u to\u parse.split（共享指示符，1）[0]
非共享设施=（美化组（非共享设施，'html.parser'））
.get_text（）
.strip（）
)
shared_便利设施=html_to_parse.split（shared_指示符，1）[1]
共享设施数组=（pd.Series（BeautifulSoup）（共享设施“html.parser”）
.get_text（）
.split（'，'））
.替换（“[^A-Za-z0-9'`]+”，regex=True）
.str.strip（）
.apply（lambda x:“{}{}”.format（x，'shared'））
)
共享设施已标记=“，”。加入（共享设施阵列）
非共享设施+，“+共享设施”

Head start advice-不要使用正则表达式解析HTML；）@liborm你的评论比我强。。。。。