Python 为什么链接列表上的索引匹配失败？_Python_Beautifulsoup

Python 为什么链接列表上的索引匹配失败？

python

Python 为什么链接列表上的索引匹配失败？,python,beautifulsoup,Python,Beautifulsoup,我正在编写一个web scraper，它返回页面上文章的链接列表。链接中有日期（例如：从2019年8月8日开始）。我想从列表中删除不符合给定日期参数的链接。我的匹配函数不起作用，我不知道为什么我可以循环浏览链接列表并从[15:21]开始打印。这将返回正确的值。因此，我不认为这是一个指数问题。我认为这是一个匹配的问题 from urllib.request import urlopen from bs4 import BeautifulSoup # Create list for links

我正在编写一个web scraper，它返回页面上文章的链接列表。链接中有日期（例如：从2019年8月8日开始）。我想从列表中删除不符合给定日期参数的链接。我的匹配函数不起作用，我不知道为什么

我可以循环浏览链接列表并从[15:21]开始打印。这将返回正确的值。因此，我不认为这是一个指数问题。我认为这是一个匹配的问题

from urllib.request import urlopen
from bs4 import BeautifulSoup 

# Create list for links

links = []

# pull the HTML

html = urlopen("https://ria.ru/search/?query=mcdonalds")
bsObj = BeautifulSoup(html)

# Collect all article links, which all have a 
# data-url attribute and are in span tags and add them to a list

for link in bsObj.findAll("span"):
    if 'data-url' in link.attrs:
        links.append(link.attrs['data-url'])

# Remove links that do not meet data parameters from the list
# This is the problematic code. 

for link in links:
    if (link[15:21]) != "201905":
    `    links.remove(link)

print(links)

返回的链接列表较短，但包含不符合日期条件的链接

比如说

[''，''，''，'']

谢谢你的帮助

改为尝试列表理解：

links=[链接中的链接，如果链接[15:21]==“201905”]

基本上，您是在尝试删除列表中的项目，而您也在迭代过程中。这会让你删除你不想删除的东西，也可能不会删除你想删除的东西

因此，在这里，我们只需枚举列表并存储完成迭代后要删除的索引。然后，我们以相反的顺序删除索引，因为如果我们以正常的顺序删除，我们将在删除索引后更改所有内容的索引。通过向后删除，我们不会影响要删除的其他索引

from urllib.request import urlopen
from bs4 import BeautifulSoup

# Create list for links

links = []

# pull the HTML

html = urlopen("https://ria.ru/search/?query=mcdonalds")
bsObj = BeautifulSoup(html)

# Collect all article links, which all have a
# data-url attribute and are in span tags and add them to a list

for link in bsObj.findAll("span"):
    if 'data-url' in link.attrs:
        links.append(link.attrs['data-url'])

# Remove links that do not meet data parameters from the list
# This is the problematic code.

remove = []
for index, link in enumerate(reversed(links)):
    if (link[15:21]) != "201905":
        remove.append(index)

for index in reversed(remove):
    del links[index]

print(links)

输出

['https://ria.ru/20181115/1532878009.html', 'https://ria.ru/20180927/1529462687.html']

或者，您可以只构建一个只包含所需链接的新列表，而不是从现有链接列表中删除链接。但希望这能帮助您理解原因。

或者，在第一个循环中进行过滤。正如@johnnymapp所评论的，您的问题是，您在迭代列表时正在从列表中删除项目，这是不安全的。非常感谢！没问题，很高兴我能帮忙。如果答案回答了您的问题，请不要忘记选择它作为答案。