使用python清理文本中具有特定类型垃圾的字符串
我想从这个字符串中提取出有意义的文本。如何清理此特定类型的字符串使用python清理文本中具有特定类型垃圾的字符串,python,string,nlp,data-cleaning,Python,String,Nlp,Data Cleaning,我想从这个字符串中提取出有意义的文本。如何清理此特定类型的字符串 '<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div dir="auto"><br></div><div dir="auto&
'<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div
dir="auto"><br></div><div dir="auto">Amount debited from my
account.</div><div dir="auto"><br></div><div dir="auto">But
ticket not received yet.</div><div dir="auto"><br></div><div
dir="auto">Please check</div></div> '
预期产出:
I booked a flight from Trivandrum to Mumbai Amount debited from my account. But
ticket not received yet. Please check
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
cleanhtml(cleanr)
'<div dir="auto">I booked a flight ticket from Trivandrum to
Mumbai<div dir="auto"><br></div><div
dir="auto">Amount debited from my account.</div><div
dir="auto"><br></div><div dir="auto">But
ticket not received yet.</div><div
dir="auto"><br></div><div dir="auto">Please
check</div></div> '
我订了一张从特里凡得伦飞往孟买的机票,机票金额从我的账户中扣除。但是
还没有收到票。请查收
进口稀土
def cleanhtml(原始html):
cleanr=re.compile(“”)
cleantext=re.sub(cleanr',原始html)
返回干净文本
cleanhtml(cleanr)
'div dir=“auto”我订了一张从特里凡得鲁姆到伦敦的机票
Mumbaidiv dir=“自动”br/div
dir=“auto”从我的账户中借记的金额。/div
dir=“auto”br/div dir=“auto”但是
尚未收到车票。/div
dir=“auto”br/div dir=“auto”请
检查/div/div
'
字符串未清理,请建议一些解决方案到目前为止您尝试了什么,尝试中出现了什么问题?请包含一个代码,这是否回答了您的问题?