Python Web抓取-在我的代码中明确层次结构的挑战_Python_Html_Web Scraping_Beautifulsoup

Python Web抓取-在我的代码中明确层次结构的挑战

python html web-scraping

Python Web抓取-在我的代码中明确层次结构的挑战,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,目标：我正在努力抓取100多个网页，特别是每个网页上的配方成分。如果我们举一个例子，其中包含鸡蛋三明治的配方（），我正在使用许多Python依赖项，包括BeautifulSoup，Splitter.Browser，ChromeDrivermanager，预期输出：一旦我刮掉了配料，我想将它们保存在字典中。下面的例子- recipes = {"quick_and_easy_egg_salad_sandwich_recipe": ['1-2 tablespoons mayonna

目标：
我正在努力抓取100多个网页，特别是每个网页上的配方成分。如果我们举一个例子，其中包含鸡蛋三明治的配方（），我正在使用许多Python依赖项，包括

BeautifulSoup

，

Splitter.Browser

，

ChromeDrivermanager

，
预期输出：
一旦我刮掉了配料，我想将它们保存在字典中。下面的例子-

recipes = {"quick_and_easy_egg_salad_sandwich_recipe":
['1-2 tablespoons mayonnaise (to taste)',
 '2 tablespoons chopped celery',
 '2 slices white, wheat, multigrain, or rye bread, toasted or plain']

我所取得的成就：
1。我已经能够“粗略地”（通过Web Inspector）确定我需要关注的内容-
看起来每个成分都有自己的

，但是我可能误解了层次结构，或者代码不正确

2.我的代码如下-

executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)

webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredients = website_soup.find('h3', class_="Ingredients")
ingredientsList = ingredients.find('li', class_ = "ingredient")
print({ingredients})

当我试图打印

{components}

时，我得到了一个

属性错误：“NoneType”对象没有属性“find”

我知道我的代码有缺陷，但是我不知道如何处理，不知道是否有人有任何建议？

试试这个

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")

soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})

recipes = {"_".join(div_.find("h2").text.split()):
               [x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}

试试这个

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")

soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})

recipes = {"_".join(div_.find("h2").text.split()):
               [x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}

在我删除了不必要的

h3

检索之后，您的代码应该在下面

executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)

webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredientsList = website_soup.find('li', class_ = "ingredient")
print({ingredients})

您试图找到

h3

元素，其中

成分

作为类名，但它不存在

在我删除了不必要的

h3

检索后，您的代码应该在下面

executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)

webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredientsList = website_soup.find('li', class_ = "ingredient")
print({ingredients})

您试图找到

h3

元素，其中

成分

作为类名，但它不存在

只是好奇，为什么要使用Splitter？h3元素没有类

成分

。所以，这是错误的

配料=网站汤。查找（'h3'，class=“配料”）

只是好奇，为什么要使用Splitter？h3元素没有类

配料。所以，这是错误的配料=网站\u汤。查找（'h3'，class=“配料”）