Python 用靓汤刮网站时未加载某些内容_Python_Web Scraping_Beautifulsoup

Python 用靓汤刮网站时未加载某些内容

python web-scraping

Python 用靓汤刮网站时未加载某些内容,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从《纽约时报》的烹饪食谱中获得评分，但在获取我需要的内容时遇到了问题。当我在NYT页面上查看源代码时，我看到以下内容： <div class="ratings-rating"> <span class="ratings-header ratings-content">194 ratings</span> <div class="ratings-stars-wrap"> <div class="rating

我正试图从《纽约时报》的烹饪食谱中获得评分，但在获取我需要的内容时遇到了问题。当我在NYT页面上查看源代码时，我看到以下内容：

<div class="ratings-rating">
    <span class="ratings-header ratings-content">194 ratings</span>

    <div class="ratings-stars-wrap">
      <div class="ratings-stars ratings-content four-star-rating avg-rating">

我使用的代码是：

url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill'

r = get(url, headers = headers, timeout=15)
page_soup = soup(r.text,'html.parser')

你有没有想过为什么这些信息不能通过呢？

试试下面的代码

import requests
import lxml
from lxml import html
import re

url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card&region=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1"

r = requests.get(url)
tree = html.fromstring(r.content)

t = tree.xpath('/html/body/script[14]')[0]

# look for value for bootstrap.recipe.avg_rating
m = re.search("bootstrap.recipe.avg_rating = ", t.text)
colon = re.search(";", t.text[m.end()::])
rating = t.text[m.end():m.end()+colon.start()]
print(rating)

# look for value for bootstrap.recipe.num_ratings = 
n = re.search("bootstrap.recipe.num_ratings = ", t.text)
colon2 = re.search(";", t.text[n.end()::])
star = t.text[n.end():n.end()+colon2.start()]
print(star)

更容易使用属性=值选择器从span中获取class

评级元数据

import requests
from bs4 import BeautifulSoup

data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)

例如，您可能必须使用带有selenium的无头浏览器来执行javascript，因为此内容是动态加载的，您在最后有了模板：

很抱歉延迟-刚刚有机会测试一下。这是可行的，而且更简单，所以可以接受。谢谢你的帮助！

import requests
from bs4 import BeautifulSoup

data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)