当一个单词附加到另一个单词时,如何从Python列表中删除它?

当一个单词附加到另一个单词时,如何从Python列表中删除它?,python,Python,我正在尝试从字符串中删除html标记,因此我尝试了以下操作: def cleaner(raw): stopwords = ['<ul>', '</ul>', '<li>', '</li>'] querywords = raw.split() resultwords = [word for word in querywords if word.lower() not in stopwords] result = '

我正在尝试从字符串中删除html标记,因此我尝试了以下操作:

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    querywords = raw.split()

    resultwords  = [word for word in querywords if word.lower() not in stopwords]
    result = ' '.join(resultwords)

    return result
此代码的问题在于,它无法删除以下单词,其中标签附加到单词:。。驾驶还有什么方法可以消除这种情况吗?

如下所示:

resultwords = [word.replace(a,'') for a in stopwords for word in querywords]
总共:

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    querywords = raw.split()
    resultwords = [word.replace(a,'') for a in stopwords for word in querywords]
    result = ' '.join(resultwords)
    return result
试试这个

import re

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    replace_ = re.compile("|".join(stopwords))
    
    return " ".join([replace_.sub("", word) for word in raw.split()])

print(cleaner("<ul>test</ul> <li>Drive<li>")) # test Drive

您可以尝试使用re删除每个html标记

输出:

“文本的试驾标题正文”

这将删除所有标记:

import re

query='<HTML><ul>list</ul>more text<li>list item</li>more html text</html>'

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    result = raw
    result = re.sub(r'<.*?>', '', raw)  # or use ' ' if you need spaces
    return result                       # OR:
    return re.sub(r' +', ' ', result)   # remove multiple spaces when needed
    
print(cleaner(query))
> listmore textlist itemmore html text
这将仅删除列表中的标记:

query='<HTML><ul>asfa</ul>lsfj;aj;lf<li>ahsdfl</li>'

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    result = raw
    for stopword in stopwords:
        result = result.replace(stopword, '')
    return result
    
print(cleaner(query))
> <HTML>listmore textlist itemmore html text</html>

如果你的问题是当你的查询词是带有html标记的前缀时,我想你可以迭代查询词,检查每个词是否以任何一个停止词开头

temp=[]
for each_word in querywords:
    for each_stop in stopwords:
        if not each_word.startswith(each_stop):
            temp.append(each_word)
这可能没有效率。我们可以替换为列表理解


-湿婆是一个简单的例子。注:这需要:

这可能有帮助:

def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    result = ""
    for word in raw.split():
        for tag in stopwords:
            if tag in word:
                word = word.replace(tag, "")
        
        if(word != ""):
            result += word +" "

    return result.rstrip()

你能举一个查询词的例子吗?可能有使用正则表达式或str.replace的解决方案。您考虑过html解析器吗?一个简单的工具是,它只需几行代码就可以完成这项工作,并且可以省去搜索带有或不带有regex.或xml.etree.ElementTree的字符串中的标记时所带来的麻烦。xml.etree.ElementTree是标准可用的,因此您不需要安装额外的软件包。我支持@0buz的评论——使用regex处理HTML是不允许的除非你很头痛。
from bs4 import BeautifulSoup

my_html="""<div> This is my list:
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
</div>"""

soup = BeautifulSoup(my_html, 'html.parser')
print(soup.text)
This is my list:
Coffee
Tea
Milk
def cleaner(raw):
    stopwords = ['<ul>', '</ul>', '<li>', '</li>']
    result = ""
    for word in raw.split():
        for tag in stopwords:
            if tag in word:
                word = word.replace(tag, "")
        
        if(word != ""):
            result += word +" "

    return result.rstrip()