Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:解析HTML以删除标记,并对标记后的所有文本应用文本转换_Python_Regex_String_Beautifulsoup - Fatal编程技术网

Python:解析HTML以删除标记,并对标记后的所有文本应用文本转换

Python:解析HTML以删除标记,并对标记后的所有文本应用文本转换,python,regex,string,beautifulsoup,Python,Regex,String,Beautifulsoup,我试图检测包含HTML标记的字符串,以及标记“共享”或“便利设施”中的某些单词,并将单词“共享”附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点 输入示例: </strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, ten

我试图检测包含HTML标记

的字符串,以及标记
“共享”或“便利设施”
中的某些单词,并将单词
“共享”
附加到该标记后面所有以逗号分隔的子字符串中。有没有一个简单的方法来实现这一点

输入示例:

</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">
swimming pool, barbecue, beach shared, tennis courts shared

为此,您可以使用一些不同的库,常见的选择是BeautifulSoup或lxml。我更喜欢lxml,因为大多数语言都有类似于regex的实现,所以感觉我会从投资中获得更多收益

from lxml import html

stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag  = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)
从lxml导入html
stuff='

游泳池、烧烤
共享俱乐部设施海滩、网球场
stuff=html.fromstring(stuff) ptag=stuff.xpath('//p/*[包含(text(),“便利设施”)或包含(text(),“共享”)]//text() 印刷品(ptag)
我使用下面的代码实现了这一点。欢迎任何意见和建议

from bs4 import BeautifulSoup

html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'

soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]

shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
         .get_text()
         .strip()
        )
shared_amenities = html_to_parse.split(shared_indicator,1)[1]

shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
          .get_text()
          .split(','))
          .replace("[^A-Za-z0-9'`]+", " ", regex = True)
          .str.strip()
        .apply(lambda x: "{}{}".format(x, ' shared'))
)

shared_amenities_tagged = ", ".join(shared_amenities_array)

non_shared_amenities + ', ' + shared_amenities_tagged
从bs4导入美化组
html_to_parse='游泳池、烧烤
共享俱乐部设施海滩、网球场
' soup=BeautifulSoup(html-to-parse) html_body=soup('body')[0] shared_indicator=html_body.find('strong','title')。get_text() 非共享便利设施=html\u to\u parse.split(共享指示符,1)[0] 非共享设施=(美化组(非共享设施,'html.parser')) .get_text() .strip() ) shared_便利设施=html_to_parse.split(shared_指示符,1)[1] 共享设施数组=(pd.Series(BeautifulSoup)(共享设施“html.parser”) .get_text() .split(',')) .替换(“[^A-Za-z0-9'`]+”,regex=True) .str.strip() .apply(lambda x:“{}{}”.format(x,'shared')) ) 共享设施已标记=“,”。加入(共享设施阵列) 非共享设施+,“+共享设施”
Head start advice-不要使用正则表达式解析HTML;)@liborm你的评论比我强。。。。。