Python 无法在产品页中找到链接_Python_Web Scraping_Beautifulsoup_Python Requests

Python 无法在产品页中找到链接

python web-scraping

Python 无法在产品页中找到链接,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我试图列出产品页面中的链接我有多个链接，我想通过这些链接获得产品页面的链接我只是发布了一个链接的代码 r = requests.get("https://funskoolindia.com/products.php?search=9723100") soup = BeautifulSoup(r.content) for a_tag in soup.find_all('a', class_='product-bg-panel', href=True): print('href: ',

我试图列出产品页面中的链接

我有多个链接，我想通过这些链接获得产品页面的链接

我只是发布了一个链接的代码

r = requests.get("https://funskoolindia.com/products.php?search=9723100")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='product-bg-panel', href=True):
    print('href: ', a_tag['href'])

这是它应该打印的内容：

https://funskoolindia.com/product_inner_page.php?product_id=1113

试试这个：

打印（'href:'，一个_标记。获取（'href'））

并将

features=“lxml”

添加到BeautifulSoup构造函数中

尝试以下操作：

print（'href:'，a_tag.get（“href”）

并将

features=“lxml”

添加到BeautifulSoup构造函数中

站点是动态的，因此，您可以使用

selenium

from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]

输出：

['product_inner_page.php?product_id=1113']

该站点是动态的，因此，您可以使用

selenium

from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]

输出：

['product_inner_page.php?product_id=1113']

数据通过Javascript从不同的URL动态加载。一种解决方案是使用

selenium

——以这种方式执行Javascript并加载链接

另一种解决方案是使用

re

模块手动解析数据url：

import re
import requests
from bs4 import BeautifulSoup

url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'

data = {'page':'1',
    'sort_val':'new',
    'product_view_val':'grid',
    'show_list':'12',
    'brand_id':'',
    'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}

soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')

for a in soup.select('#list-view .product-bg-panel > a[href]'):
    print('https://funskoolindia.com/' + a['href'])

印刷品：

https://funskoolindia.com/product_inner_page.php?product_id=1113

数据通过Javascript从不同的URL动态加载。一种解决方案是使用

selenium

——以这种方式执行Javascript并加载链接

另一种解决方案是使用

re

模块手动解析数据url：

import re
import requests
from bs4 import BeautifulSoup

url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'

data = {'page':'1',
    'sort_val':'new',
    'product_view_val':'grid',
    'show_list':'12',
    'brand_id':'',
    'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}

soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')

for a in soup.select('#list-view .product-bg-panel > a[href]'):
    print('https://funskoolindia.com/' + a['href'])

印刷品：

https://funskoolindia.com/product_inner_page.php?product_id=1113

可能重复的可能？可能重复的可能？对于动态加载的页面，我几乎总是说这是值得的instead@AeroBlue不过，你可能是对的，我不使用Scrapy:）你能用Scrapy发布一个解决方案吗？驱动程序的路径应该是什么。我在磁盘上没有看到任何chrome驱动程序。？…这给了我一个错误，对于动态加载的页面，我几乎总是说它值得Scrapyinstead@AeroBlue不过，你可能是对的，我不使用Scrapy:）你能用Scrapy发布一个解决方案吗？驱动程序的路径应该是什么。我在磁盘上没有看到任何chrome驱动程序。…这给了我错误这项工作很好，但现在我必须从提取的URL获得产品的详细信息，我认为这也将是动态的，那么我该怎么办呢..这是

re

方法将在提取的链接上运行，还是我必须使用selenium.？@jamesjoyce您可以进行实验

selenium

有它的开销，所以它比

请求

re

方法慢。我建议查看Chrome/Firefox开发者工具，查看页面从何处加载数据-然后将该url与

请求一起使用。这很好，但现在我必须从提取的url获取产品的详细信息，我认为这也是动态的，那么我该怎么办呢..这是re
方法将在提取的链接上运行，还是我必须使用selenium.？@jamesjoyce您可以进行实验selenium
有它的开销，所以它比请求
+re
方法慢。我建议查看Chrome/Firefox开发者工具，查看页面从何处加载数据，然后将该url与请求一起使用。