Python 抓取站点为链接返回不同的href_Python_Html_Web Scraping_Beautifulsoup_Python Requests

Python 抓取站点为链接返回不同的href

python html web-scraping

Python 抓取站点为链接返回不同的href,python,html,web-scraping,beautifulsoup,python-requests,Python,Html,Web Scraping,Beautifulsoup,Python Requests,在python中，我使用请求模块和BS4使用duckduckgo.com搜索web。我手动转到“hello”，并使用开发人员工具获得了第一个结果标题。现在，我使用以下代码通过Python获取href： html = requests.get('http://duckduckgo.com/html/?q=hello').content soup = BeautifulSoup4(html, 'html.parser') result = soup.find('a', class_='result_

在python中，我使用请求模块和BS4使用duckduckgo.com搜索web。我手动转到“hello”，并使用开发人员工具获得了第一个结果标题

。现在，我使用以下代码通过Python获取href：

html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']

然而，href看起来像胡言乱语，与我手动看到的完全不同。你知道为什么会发生这种情况吗？

有多个DOM元素的类名为“result\u\a”。所以，不要期望你看到的第一个链接就是你得到的第一个链接

你提到的“胡言乱语”是一个编码的URL。您需要对其进行解码和解析，以获取URL的参数（params）

例如： “/l/？kh=-1&uddg=https%3A%2F%2Fwww.example.com”

上面的href包含两个参数，即kh和uddg。我想uddg是您需要的实际链接

下面的代码将获取该特定类的所有URL，不带引号

您还可以添加您正在谈论的“胡言乱语”href吗？/l/？kh=-1&；uddg=https%3A%2F%2Fwww.example.com“这是真的。你想要的第一个链接是：

我也检查过了，看起来很奇怪@shamilpython是一个经过编码的url，我的朋友。除了

urllib.parse

，还有其他选择吗？@shamilpython urllib可以让您更轻松地完成这项工作，并且是一个流行的库。如果你真的想要另一种选择，那就是“卷发”

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
  link = anchor.get('href')
  url_obj = urlparse(link)
  parsed_url = parse_qs(url_obj.query).get('uddg', '')
  if parsed_url:
    print(unquote(parsed_url[0]))