Python 如何使用BeautifulSoup从网站检索信息？_Python_Beautifulsoup_Web Crawler

Python 如何使用BeautifulSoup从网站检索信息？

python web-crawler

Python 如何使用BeautifulSoup从网站检索信息？,python,beautifulsoup,web-crawler,Python,Beautifulsoup,Web Crawler,我遇到了一个任务，我必须使用爬虫从网站检索信息。（网址：）该网站有多种产品。对于每种产品，它都包含指向该产品网页的链接，我想收集所有链接例如，其中一个产品的名称为：KNOTTY STUFF，我希望得到href/class/details/c026829364 import requests from bs4 import BeautifulSoup def get_soup(url): source_code = requests.get(url) plain_tex

我遇到了一个任务，我必须使用爬虫从网站检索信息。（网址：）

该网站有多种产品。对于每种产品，它都包含指向该产品网页的链接，我想收集所有链接

例如，其中一个产品的名称为：KNOTTY STUFF，我希望得到href/class/details/c026829364

import requests
from bs4 import BeautifulSoup


def get_soup(url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, features="html.parser")
    return soup

url = "https://www.onepa.gov.sg/cat/adventure"
soup = get_soup(url)
for i in soup.findAll("a", {"target": "_blank"}):
    print(i.get("href"))

输出为

https://tech.gov.sg/report_vulnerability https://www.pa.gov.sg/feedback

其中不包括我要查找的内容：/class/details/c026829364

import requests
from bs4 import BeautifulSoup


def get_soup(url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, features="html.parser")
    return soup

url = "https://www.onepa.gov.sg/cat/adventure"
soup = get_soup(url)
for i in soup.findAll("a", {"target": "_blank"}):
    print(i.get("href"))

非常感谢您的帮助，谢谢

之所以会发生这种情况，是因为页面用于准备链接。因此，您将无法使用正常的

请求来完成它
相反，您应该在抓取之前将selenium与webdriver一起使用以加载所有链接
您可以尝试下载ChromeDriver可执行文件。如果将其粘贴到与脚本相同的文件夹中，则可以运行：
从selenium导入webdriver
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.common.exceptions导入WebDriverException
导入操作系统
chrome\u options=webdriver.ChromeOptions（）
chrome_选项。添加_参数（“--window size=1920x1080”）
chrome\u选项。添加\u参数（“--headless”）
chrome_driver=os.getcwd（）+“\\chromedriver.exe”#如果不是同一文件夹，请更改此路径
driver=webdriver.Chrome（选项=Chrome\u选项，可执行路径=Chrome\u驱动程序）
url=”https://www.onepa.gov.sg/cat/adventure"
获取驱动程序（url）
尝试：
#等待链接准备就绪
WebDriverWait（驱动程序，10）。直到(
EC.element可点击（（By.CSS选择器，“.gridTitle>span>a”））
)
除WebDriverException外：
打印（“页面脱机”）#添加此选项是因为页面非常不稳定：(
elements=driver。通过_css_选择器（“.gridTitle>span>a”）查找_元素
links=[elem.get_属性（'href'），用于元素中的元素]
打印（链接）
该网站是动态加载的，因此请求
不支持该网站。但是，可以通过发送POST
请求将链接发送到：
https://www.onepa.gov.sg/sitecore/shell/WebService/Card.asmx/GetCategoryCard

尝试使用内置（regex）模块搜索链接
尝试在纯文本
中搜索/class/details/c026829364。我会使用。您好，我想知道为什么在运行您提供的相同代码时得到不同的输出。输出是：['/class/details/c026831851'、'/class/details/c026831729'、'/class/details/c026831728'、'/class/details/c026831722'、'/class/details/c026831721'、'/class/details/c026831720'、'/class/details/c026831718'、'/class/details/c026831719'、'/class/details/c0268316110'、'/class/details/c02831600'、'/class/details/c0268317280'，“/class/details/c026832459'、”/class/details/c026831585'、“/class/details/c026831572'、”/class/details/c026832456'、“/class/details/c026832453'、“/class/details/c026831154”……您好，这需要我安装Chrome驱动程序吗？给我的错误是WebDriverException:Message:'Desktop\chromedriver.exe'需要位于路径中。请查看。很抱歉，答案中的链接断开。您只需将其放置在与脚本相同的文件夹中即可。嗨，亚瑟，我已成功下载并运行该程序，但我还有一个问题。对于某些类别，它有多个页面，但url保持不变，即武术类别（）。这意味着我无法在url中循环页面，是否我可以加载url中的所有元素，以便收集所有需要的信息？否则，代码将只收集第一页中的信息。我认为该网站不允许您在一个页面上加载所有1k元素。因此，我建议您使用sel查看埃尼厄姆
['/class/details/c026829364', '/interest/details/i000027991', '/interest/details/i000009714']