在Python中刮取url_Python_Ajax_Web Scraping_Beautifulsoup_Python Requests

在Python中刮取url

python ajax web-scraping

在Python中刮取url,python,ajax,web-scraping,beautifulsoup,python-requests,Python,Ajax,Web Scraping,Beautifulsoup,Python Requests,我正试图从搜索页面上找到阿迪达斯的鞋子，却不知道我做错了什么我尝试了tags=soup.find（“section”，“class”：“productList”）.findAll（“a”）不起作用：( 我还尝试打印所有href，但所需链接不在其中：( 所以我希望打印这封信： https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-

我正试图从搜索页面上找到阿迪达斯的鞋子，却不知道我做错了什么

我尝试了tags=soup.find（“section”，“class”：“productList”）.findAll（“a”）
不起作用：(
我还尝试打印所有
href
，但所需链接不在其中：(
所以我希望打印这封信：

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138 from bs4 import BeautifulSoup import requests url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892" # Getting the webpage, creating a Response object. response = requests.get(url) # Extracting the source code of the page. data = response.text # Passing the source code to BeautifulSoup to create a BeautifulSoup object for it. soup = BeautifulSoup(data, 'lxml') # Extracting all the <a> tags into a list. tags = soup.find("section", {"class": "productList"}).findAll("a") # Extracting URLs from the attribute href in the <a> tags. for tag in tags: print(tag.get('href'))

from itertools import chain from requests_html import HTMLSession session = HTMLSession() url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost' r = session.get(url) r.html.render() links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138 从bs4导入BeautifulSoup 导入请求 url=”https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892" #获取网页，创建响应对象。 response=requests.get（url） #提取页面的源代码。 data=response.text #将源代码传递给BeautifulSoup以为其创建BeautifulSoup对象。 soup=BeautifulSoup（数据'lxml'） #提取所有
所以你的代码是这样的

import urllib from bs4 import BeautifulSoup import requests url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892" r = urllib.urlopen(url).read() soup = BeautifulSoup(r, "html.parser") productMarkup = soup.find_all("section", class_=["productList"]) product = productMarkup.get_text()

解决方案如下：

import requests import bs4.BeautifulSoup as bs url="https://www.tennisexpress.com/mens-adidas-tennis-shoes" req = requests.get(url) soup = bs(req.text,'lxml') # lxml because page is more xml than html arts = soup.findAll("a",class_="product")

这将为您提供所有阿迪达斯网球鞋的链接列表！我相信您可以从那里进行管理。
通过查看Chrome DevTools中的“网络”选项卡，您可以注意到您搜索的产品是在向
发出请求后获取的https://tennisexpress-com.ecomm-nav.com/search.js
。您可以看到示例响应。如您所见，这是一片混乱，所以我不会采用这种方法
在您的代码中，您无法看到产品，因为请求是在初始页面加载后由JavaScript（在浏览器中运行）发出的。既不是独立的，也不能呈现该内容。但是，您可以使用JavaScript支持（它在后台使用Chromium）来实现这一点
代码：

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138 from bs4 import BeautifulSoup import requests url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892" # Getting the webpage, creating a Response object. response = requests.get(url) # Extracting the source code of the page. data = response.text # Passing the source code to BeautifulSoup to create a BeautifulSoup object for it. soup = BeautifulSoup(data, 'lxml') # Extracting all the <a> tags into a list. tags = soup.find("section", {"class": "productList"}).findAll("a") # Extracting URLs from the attribute href in the <a> tags. for tag in tags: print(tag.get('href'))

from itertools import chain from requests_html import HTMLSession session = HTMLSession() url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost' r = session.get(url) r.html.render() links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))
我曾经将所有具有绝对链接的集合连接在一起，并从中创建了一个列表

>>> links ['https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-black-and-night-metallic-62110', 'https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-white-and-matte-silver-62109', ... 'https://www.tennisexpress.com/adidas-mens-supernova-glide-7-running-shoes-black-and-white-41636', 'https://www.tennisexpress.com/adidas-womens-adizero-boston-6-running-shoes-solar-yellow-and-midnight-gray-45268']

别忘了用
pip install Requests HTML安装Requests HTML
我能得到完整的代码吗？或者至少请指导我在上面的代码中输入到哪里？谢谢！你测试过了吗？@PaulaThomas我测试过了。不走运。也许我没有正确地输入它，这就是为什么我要求提供完整的代码，或者在哪里放置它的指南我的代码。谢谢你，Paula，也请检查我提到的标记标识符。@AnotherUser31我也无法让它工作。我怀疑该页面使用AJAX或类似的工具。你是说我无法从该搜索页面获取鞋子？因为对我来说，找到我只需要的产品会更容易，而不是去寻找我需要的产品。我相信我们能想出一些办法。我30分钟后回来，也许我们可以聊天。但不知道如何建立聊天室。请帮我解决这个问题。我无法让它给我一个鞋子列表：（confusedJust do
links=[art['href']for arts in arts]
获取“鞋子列表”，但该解决方案仍然没有回答您的问题，并且做了一件完全不同的事情。您的回答确实澄清了很多问题！我试图运行您的代码，但出现了一个错误：
modulenofounderror:No module name'requests\u html
抱歉，忘了提及您需要安装此软件包。我更新了我的答案。hmmm…出现问题s在mac siera 10.12.6/上安装
pip install requests html
，使用python 3.6…说
无法为websockets构建轮子
我猜您是在全局安装软件包，并且存在一些冲突。您可以查看一下如何在每个项目中独立管理软件包。可能您忘记了
print（）