Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python清理文本中具有特定类型垃圾的字符串_Python_String_Nlp_Data Cleaning - Fatal编程技术网

使用python清理文本中具有特定类型垃圾的字符串

使用python清理文本中具有特定类型垃圾的字符串,python,string,nlp,data-cleaning,Python,String,Nlp,Data Cleaning,我想从这个字符串中提取出有意义的文本。如何清理此特定类型的字符串 '<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div dir="auto"><br></div><div dir="auto&

我想从这个字符串中提取出有意义的文本。如何清理此特定类型的字符串

'<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div 
dir="auto"><br></div><div dir="auto">Amount debited from my 
account.</div><div dir="auto"><br></div><div dir="auto">But 
ticket not received yet.</div><div dir="auto"><br></div><div 
dir="auto">Please check</div></div>
'
预期产出:

I booked a flight from Trivandrum to Mumbai Amount debited from my account. But 
ticket not received yet. Please check

import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

cleanhtml(cleanr)
'&lt;div dir=&quot;auto&quot;&gt;I booked a flight ticket from Trivandrum to 
Mumbai&lt;div dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;Amount debited from&nbsp;my account.&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div dir=&quot;auto&quot;&gt;But 
ticket not received yet.&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div dir=&quot;auto&quot;&gt;Please 
check&lt;/div&gt;&lt;/div&gt;&#13;&#10;'
我订了一张从特里凡得伦飞往孟买的机票,机票金额从我的账户中扣除。但是
还没有收到票。请查收
进口稀土
def cleanhtml(原始html):
cleanr=re.compile(“”)
cleantext=re.sub(cleanr',原始html)
返回干净文本
cleanhtml(cleanr)
'div dir=“auto”我订了一张从特里凡得鲁姆到伦敦的机票
Mumbaidiv dir=“自动”br/div
dir=“auto”从我的账户中借记的金额。/div
dir=“auto”br/div dir=“auto”但是
尚未收到车票。/div
dir=“auto”br/div dir=“auto”请
检查/div/div
'

字符串未清理,请建议一些解决方案

到目前为止您尝试了什么,尝试中出现了什么问题?请包含一个代码,这是否回答了您的问题?