Python 找不到正确的网页来抓取数据-网页抓取_Python_Web Scraping_Beautifulsoup

Python 找不到正确的网页来抓取数据-网页抓取

python web-scraping

Python 找不到正确的网页来抓取数据-网页抓取,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想为这方面的一系列课程讨价还价然而，我很难找到可以看到整个课程列表及其价格的页面我能想出下面的代码，它拉单课程的价格： import pandas as pd import requests url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education" html = requests.get(url) soup = BeautifulSoup(h

我想为这方面的一系列课程讨价还价

然而，我很难找到可以看到整个课程列表及其价格的页面

我能想出下面的代码，它拉单课程的价格：

import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)

感谢您的帮助/建议

您可以通过将搜索锚定在商品的父包装上来查找价格：

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education').text, 'html.parser')
prices = [i.find_all('div', {'class':'field-item even'})[2].text for i in d.find_all('fieldset', {'class':' group-overview field-group-fieldset panel panel-default form-wrapper'})]

输出：

['5141.00']

使用关键字搜索项目，然后根据搜索结果获取所有URL。一旦你得到了URL，循环它

from bs4 import BeautifulSoup
import requests
Search_key='pinnacle'
url = "https://www.learningconnection.philips.com/en/search/site/{}".format(Search_key)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
urls=[item['href'] for item in soup.select('h3.title > a')]
price=[]
for url in urls:
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    if soup.select_one("[class*='field-price'] .even"):
        price.append(soup.select_one("[class*='field-price'] .even").text)
print(price)

输出：

['5171.00', '5171.00', '3292.00', '5141.00', '4309.00', '2130.00', '2130.00', '2130.00']

['Pinnacle³ Auto Segmentation with SPICE', 'Pinnacle³ Dynamic Planning', 'Pinnacle³ Additional Education', 'Pinnacle³ Advanced Planning Education', 'Pinnacle³ Basic Planning Education', 'Pinnacle³ Physics Modeling', 'Pinnacle³ Level I Basic Planning Education', 'Pinnacle³ Level II Education']
['5171.00', '5171.00', '3292.00', '5141.00', '4309.00', '2130.00', '2130.00', '2130.00']

您还可以打印课程标题

from bs4 import BeautifulSoup
import requests
Search_key='pinnacle'
url = "https://www.learningconnection.philips.com/en/search/site/{}".format(Search_key)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
urls=[item['href'] for item in soup.select('h3.title > a')]
price=[]
title=[]
for url in urls:
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    if soup.select_one("[class*='field-price'] .even"):
        title.append(soup.select_one("h1#page-title").text)
        price.append(soup.select_one("[class*='field-price'] .even").text)
print(title)
print(price)

输出：

['5171.00', '5171.00', '3292.00', '5141.00', '4309.00', '2130.00', '2130.00', '2130.00']

['Pinnacle³ Auto Segmentation with SPICE', 'Pinnacle³ Dynamic Planning', 'Pinnacle³ Additional Education', 'Pinnacle³ Advanced Planning Education', 'Pinnacle³ Basic Planning Education', 'Pinnacle³ Physics Modeling', 'Pinnacle³ Level I Basic Planning Education', 'Pinnacle³ Level II Education']
['5171.00', '5171.00', '3292.00', '5141.00', '4309.00', '2130.00', '2130.00', '2130.00']

已编辑

from bs4 import BeautifulSoup
import requests
Search_key='biomed'
url = "https://www.learningconnection.philips.com/en/search/site/{}".format(Search_key)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
urls=[item['href'] for item in soup.select('h3.title > a')]
print(len(urls))
price=[]
title=[]
for url in urls:
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    if soup.select_one("[class*='field-price'] .even"):
        title.append(soup.select_one("h1#page-title").text)
        price.append(soup.select_one("[class*='field-price'] .even").text)
print(title)
print(price)

输出：

28
['NETWORK CONCEPTS (BIOMED)']
['4875.00']

这里有一种通过感兴趣的区域进行循环的方法。使用bs4 4.7.1+是为了访问

：contains

import requests
from bs4 import BeautifulSoup as bs

base = 'https://www.learningconnection.philips.com'
url = f'{base}/en/catalog/profession/biomedical-engineers'
courses = []
results = []

with requests.Session() as s:
    r = s.get(url)
    soup = bs(r.content, 'lxml')
    links = [base + i['href'] for i in soup.select('h3 a')]
    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        courses+=[i['href'] for i in soup.select('.title a')]
    for course in courses:
        r = s.get(course)
        soup = bs(r.content, 'lxml')
        price = soup.select_one('em:contains("Tuition:")')
        if price is None:
            price = 'Not listed'
        else:
            price = price.text.replace('\xa0',' ')
        result = {'Title':soup.select_one('#page-title').text.replace('\xa0',' ')
                 ,'Description': soup.select_one('.field-item p').text.replace('\xa0',' ')
                 ,'Price': price
                 , 'Url':course}
        results.append(result)

print(results)

谢谢你的回复。这似乎是另一种获取这门课程价格的方法？我要找的是一个页面，可以把所有的课程都删掉。有很多课程。您需要一个爬行器沿着链接向下爬行，并可能对结果进行设置。例如，临床信息学和生物医学工程师课程之间是否存在重叠？或者是因为您有特定的专业/产品/临床重点，并且只对该领域内的课程感兴趣？也许可以查看并确定需要应用哪些过滤器。是的，我对“生物医学工程师”的课程特别感兴趣：“生物医学工程师”课程有3个子模式，总共大约24门课程。谢谢，但是当我将搜索关键字从“pinnacle”更改为“biomed”时，我只得到一个结果。如果我在网站上搜索，我会得到28个…@Vic：那是因为除了一个URL外，其他URL没有价格标签。如果你

print（len（url））

你会得到所有28个链接。