Python BeautifulSoup-按标点符号筛选铁路超高列表结果
我试图从Python的结果中排除问号和冒号,但是它们一直出现在最终输出中。结果按“无”过滤,但不按标点符号过滤 任何帮助都将不胜感激Python BeautifulSoup-按标点符号筛选铁路超高列表结果,python,beautifulsoup,python-3.7,Python,Beautifulsoup,Python 3.7,我试图从Python的结果中排除问号和冒号,但是它们一直出现在最终输出中。结果按“无”过滤,但不按标点符号过滤 任何帮助都将不胜感激 #Scrape BBC for Headline text url = 'https://www.bbc.co.uk/news' res = requests.get(url) html_page = res.content soup = BeautifulSoup(html_page, 'html.parser') tags = soup.find_all(c
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if i.string != ":":
if i.string != "?":
headlines.append(i.string)
您正在将整个字符串与字符进行比较,但想知道字符串是否包含字符-如果您真的想这样做,只需使用
不在中即可:
if ':' not in i.string:
if '?' not in i.string:
您的方法的问题是,您将跳过结果。我认为最好清除循环中的结果并替换这些字符:
for i in tags:
print(i.string.replace(':', '').replace(':',''))
如果你想清除更多的字符,也许有更好的方法使用正则表达式
示例
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if ':' not in i.string:
if '?' not in i.string:
headlines.append(i.string)
headlines
from bs4 import BeautifulSoup
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = []
def hd_format(text):
return re.sub(r"\?|\:", "", text)
for i in tags:
if i.string is not None:
headlines.append(hd_format(i.string))
下面是一个正则表达式格式化函数,用于从字符串中排除?
和:
:
def hd_format(text):
return re.sub(r"\?|\:", "", text)
您可以添加任何其他要排除的字符,只需使用\
分隔它们,并使用\
转义特殊字符即可
完整代码
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if ':' not in i.string:
if '?' not in i.string:
headlines.append(i.string)
headlines
from bs4 import BeautifulSoup
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = []
def hd_format(text):
return re.sub(r"\?|\:", "", text)
for i in tags:
if i.string is not None:
headlines.append(hd_format(i.string))
不幸的是,这仍然让我需要删除的标点符号通过。也许这适用于较旧版本的BeautifulSoup。在第4行,我的版本不承认“bs”缩写。不过,我非常感谢你的帮助。如果你只想删除?而且:我的代码应该可以工作。“bs”是我导入它的方式。我将更新代码以匹配您的导入