Python 使用Beautfiul Soup从blogspot网站提取特定的链接组_Python_Beautifulsoup_Automation

Python 使用Beautfiul Soup从blogspot网站提取特定的链接组

python automation

Python 使用Beautfiul Soup从blogspot网站提取特定的链接组,python,beautifulsoup,automation,Python,Beautifulsoup,Automation,我想提取，比方说，学校网站上每七年一次的链接。在档案中，使用ctrl+f“year-7”很容易找到。不过，这对beautifulSoup来说并不容易。也许我做错了 import requests from bs4 import BeautifulSoup URL = '~school URL~' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') for link in soup.find_a

我想提取，比方说，学校网站上每七年一次的链接。在档案中，使用ctrl+f“year-7”很容易找到。不过，这对beautifulSoup来说并不容易。也许我做错了

import requests
from bs4 import BeautifulSoup

URL = '~school URL~'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

这给了我网站档案上的每个链接。对我来说重要的每一个环节都是这样的：

~school URL~blogspot.com/2020/10/mathematics-activity-year-x.html

我尝试将“（link.get（'href'））存储在一个变量上，并在其上搜索“year-x”，但这不起作用

有没有关于我如何搜索它的想法？博客搜索是可怕的。我做这个项目是为了帮助贫困地区的孩子们找到更容易的课程，因为这一切都留在了下一学年的网站上，有数百个不同学年没有标签的链接。我正在尝试至少为每个学年编辑一个链接列表，以帮助他们。

如果我理解，您希望从链接中提取年份。尝试使用提取年份

在您的情况下，它将是：

import re
from bs4 import BeautifulSoup

txt = """<a href="blogspot.com/2020/10/mathematics-activity-year-x.html"</a>"""
soup = BeautifulSoup(txt, "html.parser")

years = []

for tag in soup.find_all("a"):
    link = tag.get("href")
    year = re.search(r"year-.?", link).group()
    years.append(year)

print(years)

编辑尝试使用CSS选择器选择以

year-7.html结尾的所有href

...
for tag in soup.select('a[href$="year-7.html"]'):
        print(tag)

所以，如果我明白了，你想获得第七年的所有链接吗？是的！然后在第八年、第九年都这样做……你们能用其他HTML链接编辑你们的问题吗（你们做什么和不想做什么）？类似于：``~学校URL~blogspot.com/2020/10/geography-activity-year-1.html~学校URL~blogspot.com/2020/10/history-activity-year-3.html~学校URL~blogspot.com/2020/10/english-activity-year-8.html``等等。所有内容都混在一起了。实际的HTML标记是什么？很抱歉，如果我没有说清楚，我会尝试重新编写：在我打印网站存档的每个URL之后，我每年都有活动，像这样：~URL~.blogspot.com/2020/10/math-activity-year-2.html~URL~.blogspot.com/2020/10/math-activity-year-9.html~URL~.blogspot.com/2020/10/math-activity-year-5.html将其乘以数百……大量链接。所以我想搜索每一个包含“year-7”的URL，这样我就可以在某个地方收集所有year-7的链接。我是怎么做到的？我用熊猫完成了这个项目，但是谢谢你和我一起尝试！
...
for tag in soup.select('a[href$="year-7.html"]'):
        print(tag)