在Python中刮取url

在Python中刮取url,python,ajax,web-scraping,beautifulsoup,python-requests,Python,Ajax,Web Scraping,Beautifulsoup,Python Requests,我正试图从搜索页面上找到阿迪达斯的鞋子,却不知道我做错了什么 我尝试了tags=soup.find(“section”,“class”:“productList”).findAll(“a”) 不起作用:( 我还尝试打印所有href,但所需链接不在其中:( 所以我希望打印这封信: https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-

我正试图从搜索页面上找到阿迪达斯的鞋子,却不知道我做错了什么

我尝试了
tags=soup.find(“section”,“class”:“productList”).findAll(“a”)
不起作用:(

我还尝试打印所有
href
,但所需链接不在其中:(

所以我希望打印这封信:

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138


from bs4 import BeautifulSoup
import requests

url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find("section", {"class": "productList"}).findAll("a")

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    print(tag.get('href'))
from itertools import chain
from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost'
r = session.get(url)
r.html.render()

links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))
https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138
从bs4导入BeautifulSoup
导入请求
url=”https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"
#获取网页,创建响应对象。
response=requests.get(url)
#提取页面的源代码。
data=response.text
#将源代码传递给BeautifulSoup以为其创建BeautifulSoup对象。
soup=BeautifulSoup(数据'lxml')
#提取所有
所以你的代码是这样的

import urllib
from bs4 import BeautifulSoup
import requests

url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"

r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, "html.parser")
productMarkup = soup.find_all("section", class_=["productList"])
product = productMarkup.get_text()

解决方案如下:

import requests
import bs4.BeautifulSoup as bs
url="https://www.tennisexpress.com/mens-adidas-tennis-shoes"
req = requests.get(url)
soup = bs(req.text,'lxml') # lxml because page is more xml than html
arts = soup.findAll("a",class_="product")

这将为您提供所有阿迪达斯网球鞋的链接列表!我相信您可以从那里进行管理。

通过查看Chrome DevTools中的“网络”选项卡,您可以注意到您搜索的产品是在向
发出请求后获取的https://tennisexpress-com.ecomm-nav.com/search.js
。您可以看到示例响应。如您所见,这是一片混乱,所以我不会采用这种方法

在您的代码中,您无法看到产品,因为请求是在初始页面加载后由JavaScript(在浏览器中运行)发出的。既不是独立的,也不能呈现该内容。但是,您可以使用JavaScript支持(它在后台使用Chromium)来实现这一点

代码:

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138


from bs4 import BeautifulSoup
import requests

url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find("section", {"class": "productList"}).findAll("a")

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    print(tag.get('href'))
from itertools import chain
from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost'
r = session.get(url)
r.html.render()

links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))
我曾经将所有具有绝对链接的集合连接在一起,并从中创建了一个列表

>>> links
['https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-black-and-night-metallic-62110',
 'https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-white-and-matte-silver-62109',
 ...
 'https://www.tennisexpress.com/adidas-mens-supernova-glide-7-running-shoes-black-and-white-41636',
 'https://www.tennisexpress.com/adidas-womens-adizero-boston-6-running-shoes-solar-yellow-and-midnight-gray-45268']

别忘了用
pip install Requests HTML安装Requests HTML

我能得到完整的代码吗?或者至少请指导我在上面的代码中输入到哪里?谢谢!你测试过了吗?@PaulaThomas我测试过了。不走运。也许我没有正确地输入它,这就是为什么我要求提供完整的代码,或者在哪里放置它的指南我的代码。谢谢你,Paula,也请检查我提到的标记标识符。@AnotherUser31我也无法让它工作。我怀疑该页面使用AJAX或类似的工具。你是说我无法从该搜索页面获取鞋子?因为对我来说,找到我只需要的产品会更容易,而不是去寻找我需要的产品。我相信我们能想出一些办法。我30分钟后回来,也许我们可以聊天。但不知道如何建立聊天室。请帮我解决这个问题。我无法让它给我一个鞋子列表:(confusedJust do
links=[art['href']for arts in arts]
获取“鞋子列表”,但该解决方案仍然没有回答您的问题,并且做了一件完全不同的事情。您的回答确实澄清了很多问题!我试图运行您的代码,但出现了一个错误:
modulenofounderror:No module name'requests\u html
抱歉,忘了提及您需要安装此软件包。我更新了我的答案。hmmm…出现问题s在mac siera 10.12.6/上安装
pip install requests html
,使用python 3.6…说
无法为websockets构建轮子
我猜您是在全局安装软件包,并且存在一些冲突。您可以查看一下如何在每个项目中独立管理软件包。可能您忘记了
print()