Python Web抓取-在我的代码中明确层次结构的挑战
目标:Python Web抓取-在我的代码中明确层次结构的挑战,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,目标:我正在努力抓取100多个网页,特别是每个网页上的配方成分。如果我们举一个例子,其中包含鸡蛋三明治的配方(),我正在使用许多Python依赖项,包括BeautifulSoup,Splitter.Browser,ChromeDrivermanager,预期输出:一旦我刮掉了配料,我想将它们保存在字典中。下面的例子- recipes = {"quick_and_easy_egg_salad_sandwich_recipe": ['1-2 tablespoons mayonna
我正在努力抓取100多个网页,特别是每个网页上的配方成分。如果我们举一个例子,其中包含鸡蛋三明治的配方(),我正在使用许多Python依赖项,包括
BeautifulSoup
,Splitter.Browser
,ChromeDrivermanager
,预期输出:
一旦我刮掉了配料,我想将它们保存在字典中。下面的例子-
recipes = {"quick_and_easy_egg_salad_sandwich_recipe":
['1-2 tablespoons mayonnaise (to taste)',
'2 tablespoons chopped celery',
'2 slices white, wheat, multigrain, or rye bread, toasted or plain']
我所取得的成就:1。我已经能够“粗略地”(通过Web Inspector)确定我需要关注的内容-
看起来每个成分都有自己的
,但是我可能误解了层次结构,或者代码不正确
2.我的代码如下-executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)
webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredients = website_soup.find('h3', class_="Ingredients")
ingredientsList = ingredients.find('li', class_ = "ingredient")
print({ingredients})
当我试图打印{components}
时,我得到了一个属性错误:“NoneType”对象没有属性“find”
我知道我的代码有缺陷,但是我不知道如何处理,不知道是否有人有任何建议?试试这个
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")
soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})
recipes = {"_".join(div_.find("h2").text.split()):
[x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}
试试这个
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")
soup = BeautifulSoup(resp.text, "html.parser")
div_ = soup.find("div", attrs={"class": "recipe-callout"})
recipes = {"_".join(div_.find("h2").text.split()):
[x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}
在我删除了不必要的
h3
检索之后,您的代码应该在下面
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)
webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredientsList = website_soup.find('li', class_ = "ingredient")
print({ingredients})
您试图找到
h3
元素,其中成分
作为类名,但它不存在在我删除了不必要的h3
检索后,您的代码应该在下面
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)
webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
browser.visit(webpage_url)
time.sleep(1)
website_html = browser.html
website_soup = BeautifulSoup(website_html, 'html.parser')
ingredientsList = website_soup.find('li', class_ = "ingredient")
print({ingredients})
您试图找到
h3
元素,其中成分
作为类名,但它不存在只是好奇,为什么要使用Splitter?h3元素没有类成分
。所以,这是错误的配料=网站汤。查找('h3',class=“配料”)
只是好奇,为什么要使用Splitter?h3元素没有类配料。所以,这是错误的配料=网站\u汤。查找('h3',class=“配料”)