当一个单词附加到另一个单词时,如何从Python列表中删除它?
我正在尝试从字符串中删除html标记,因此我尝试了以下操作:当一个单词附加到另一个单词时,如何从Python列表中删除它?,python,Python,我正在尝试从字符串中删除html标记,因此我尝试了以下操作: def cleaner(raw): stopwords = ['<ul>', '</ul>', '<li>', '</li>'] querywords = raw.split() resultwords = [word for word in querywords if word.lower() not in stopwords] result = '
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
querywords = raw.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
return result
此代码的问题在于,它无法删除以下单词,其中标签附加到单词:。。驾驶还有什么方法可以消除这种情况吗?如下所示:
resultwords = [word.replace(a,'') for a in stopwords for word in querywords]
总共:
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
querywords = raw.split()
resultwords = [word.replace(a,'') for a in stopwords for word in querywords]
result = ' '.join(resultwords)
return result
试试这个
import re
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
replace_ = re.compile("|".join(stopwords))
return " ".join([replace_.sub("", word) for word in raw.split()])
print(cleaner("<ul>test</ul> <li>Drive<li>")) # test Drive
您可以尝试使用re删除每个html标记 输出: “文本的试驾标题正文”这将删除所有标记:
import re
query='<HTML><ul>list</ul>more text<li>list item</li>more html text</html>'
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
result = raw
result = re.sub(r'<.*?>', '', raw) # or use ' ' if you need spaces
return result # OR:
return re.sub(r' +', ' ', result) # remove multiple spaces when needed
print(cleaner(query))
> listmore textlist itemmore html text
这将仅删除列表中的标记:
query='<HTML><ul>asfa</ul>lsfj;aj;lf<li>ahsdfl</li>'
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
result = raw
for stopword in stopwords:
result = result.replace(stopword, '')
return result
print(cleaner(query))
> <HTML>listmore textlist itemmore html text</html>
如果你的问题是当你的查询词是带有html标记的前缀时,我想你可以迭代查询词,检查每个词是否以任何一个停止词开头
temp=[]
for each_word in querywords:
for each_stop in stopwords:
if not each_word.startswith(each_stop):
temp.append(each_word)
这可能没有效率。我们可以替换为列表理解
-湿婆是一个简单的例子。注:这需要: 这可能有帮助:
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
result = ""
for word in raw.split():
for tag in stopwords:
if tag in word:
word = word.replace(tag, "")
if(word != ""):
result += word +" "
return result.rstrip()
你能举一个查询词的例子吗?可能有使用正则表达式或str.replace的解决方案。您考虑过html解析器吗?一个简单的工具是,它只需几行代码就可以完成这项工作,并且可以省去搜索带有或不带有regex.或xml.etree.ElementTree的字符串中的标记时所带来的麻烦。xml.etree.ElementTree是标准可用的,因此您不需要安装额外的软件包。我支持@0buz的评论——使用regex处理HTML是不允许的除非你很头痛。
from bs4 import BeautifulSoup
my_html="""<div> This is my list:
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
</div>"""
soup = BeautifulSoup(my_html, 'html.parser')
print(soup.text)
This is my list:
Coffee
Tea
Milk
def cleaner(raw):
stopwords = ['<ul>', '</ul>', '<li>', '</li>']
result = ""
for word in raw.split():
for tag in stopwords:
if tag in word:
word = word.replace(tag, "")
if(word != ""):
result += word +" "
return result.rstrip()