如何使用python中的BeautifulSoup包从网站提取a href内容_Python_Beautifulsoup

如何使用python中的BeautifulSoup包从网站提取a href内容

python

如何使用python中的BeautifulSoup包从网站提取a href内容,python,beautifulsoup,Python,Beautifulsoup,我有下面的例子 <h2 class="m0 t-regular"> <a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199"> Executive Chef </a> </h2> 结果: []

我有下面的例子

<h2 class="m0 t-regular">
<a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199">
Executive Chef  </a>
</h2>

结果: []

问题是如何返回链接？

如果您离链接很近，请使用

['href']

获取url

示例

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
    if a['href'] not in links:
        links.append(a['href'])
links

要获取href链接，您需要以下代码：

follow_links = [p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

加上“https://www.bayt.com/“如果您不想，只需href=“/en/kartar/jobs/executive-chef-4276199/”

请尝试以下操作以获取href:

follow_links=soup.find_all('your class a') 
for link in follow_links: #Then you can process it with something like:
    if "#" not in link.a['href']: 
        follow_links + [link.a["href"]]

你被抓住了

<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef  </a>
</h2>

根据您的代码，您正在提取

h2

标记，您应该获得h2的下一个标记，即

标记，从那里您只能获得具有

href

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]

<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef  </a>
</h2>

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
# my solution
links = soup.select('h2.m0.t-regular')
for link in links:
    print(link.a['href'])

print(soup.find_all("h2", class_="m0 t-regular")[0])
follow_links = [
     tag_a.a["href"] for tag_a in
     soup.find_all("h2", class_="m0 t-regular")
     if "#" not in tag_a.a["href"]
 ]
print(follow_links)

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]