Python 清理已删除的HTML列表_Python_Web Scraping_Beautifulsoup

Python 清理已删除的HTML列表

python web-scraping

Python 清理已删除的HTML列表,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试从wiki页面提取名称。使用BeautifulSoup，我可以得到一个非常脏的列表（包括许多无关的项目），我想清理，但我尝试“清理”列表时，列表保持不变 #1). #Retreive the page import requests from bs4 import BeautifulSoup weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons' weapons_page = requests.get(weapons

我正在尝试从wiki页面提取名称。使用BeautifulSoup，我可以得到一个非常脏的列表（包括许多无关的项目），我想清理，但我尝试“清理”列表时，列表保持不变

#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')

#2).    
#Attain the data I need, plus lot of unhelpful data   
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")

#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

#4).
#Check data
print(weapon_names_sanitised)
#Returns  a list identical to flithy_scraped_weapon_names

问题在这一部分：

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

它应该是：

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in str(s) for xs in dirt)]

原因是

flithy\u scraped\u武器\u name

包含

Tag

对象，这些对象在打印时将转换为字符串，但需要显式转换为字符串，以便str（s）中的

xs按预期工作。
尽管我怀疑这是您追求的最终结果，但您当前的问题是如果没有问题的话（s中的xs表示污垢中的xs）
-如果没有，则应为（str中的xs表示污垢中的xs）]
，因为s
在该表达式中不是字符串，但是标记
确实成功地转换为字符串。实际上，这对我来说是向前迈出了相当大的一步，所以谢谢。@Grismar通过扩展我的dirty
列表，我已经基本解决了这个问题，所以如果你将此作为答案发布，我可以将其标记为正确。