Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/cmake/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Web抓取-在我的代码中明确层次结构的挑战_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python Web抓取-在我的代码中明确层次结构的挑战

Python Web抓取-在我的代码中明确层次结构的挑战,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,目标:我正在努力抓取100多个网页,特别是每个网页上的配方成分。如果我们举一个例子,其中包含鸡蛋三明治的配方(),我正在使用许多Python依赖项,包括BeautifulSoup,Splitter.Browser,ChromeDrivermanager,预期输出:一旦我刮掉了配料,我想将它们保存在字典中。下面的例子- recipes = {"quick_and_easy_egg_salad_sandwich_recipe": ['1-2 tablespoons mayonna

目标:
我正在努力抓取100多个网页,特别是每个网页上的配方成分。如果我们举一个例子,其中包含鸡蛋三明治的配方(),我正在使用许多Python依赖项,包括
BeautifulSoup
Splitter.Browser
ChromeDrivermanager

预期输出:
一旦我刮掉了配料,我想将它们保存在字典中。下面的例子-

recipes = {"quick_and_easy_egg_salad_sandwich_recipe":
['1-2 tablespoons mayonnaise (to taste)',
 '2 tablespoons chopped celery',
 '2 slices white, wheat, multigrain, or rye bread, toasted or plain']
我所取得的成就:
1。我已经能够“粗略地”(通过Web Inspector)确定我需要关注的内容-
看起来每个成分都有自己的
  • ,但是我可能误解了层次结构,或者代码不正确

    2.我的代码如下-

    executable_path = {'executable_path': ChromeDriverManager().install()}
    browser = Browser('chrome', **executable_path)
    
    webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
    browser.visit(webpage_url)
    time.sleep(1)
    website_html = browser.html
    website_soup = BeautifulSoup(website_html, 'html.parser')
    ingredients = website_soup.find('h3', class_="Ingredients")
    ingredientsList = ingredients.find('li', class_ = "ingredient")
    print({ingredients})
    
    当我试图打印
    {components}
    时,我得到了一个
    属性错误:“NoneType”对象没有属性“find”


    我知道我的代码有缺陷,但是我不知道如何处理,不知道是否有人有任何建议?

    试试这个

    import requests
    from bs4 import BeautifulSoup
    
    resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")
    
    soup = BeautifulSoup(resp.text, "html.parser")
    div_ = soup.find("div", attrs={"class": "recipe-callout"})
    
    recipes = {"_".join(div_.find("h2").text.split()):
                   [x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}
    
    试试这个

    import requests
    from bs4 import BeautifulSoup
    
    resp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")
    
    soup = BeautifulSoup(resp.text, "html.parser")
    div_ = soup.find("div", attrs={"class": "recipe-callout"})
    
    recipes = {"_".join(div_.find("h2").text.split()):
                   [x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}
    

    在我删除了不必要的
    h3
    检索之后,您的代码应该在下面

    executable_path = {'executable_path': ChromeDriverManager().install()}
    browser = Browser('chrome', **executable_path)
    
    webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
    browser.visit(webpage_url)
    time.sleep(1)
    website_html = browser.html
    website_soup = BeautifulSoup(website_html, 'html.parser')
    ingredientsList = website_soup.find('li', class_ = "ingredient")
    print({ingredients})
    

    您试图找到
    h3
    元素,其中
    成分
    作为类名,但它不存在

    在我删除了不必要的
    h3
    检索后,您的代码应该在下面

    executable_path = {'executable_path': ChromeDriverManager().install()}
    browser = Browser('chrome', **executable_path)
    
    webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'
    browser.visit(webpage_url)
    time.sleep(1)
    website_html = browser.html
    website_soup = BeautifulSoup(website_html, 'html.parser')
    ingredientsList = website_soup.find('li', class_ = "ingredient")
    print({ingredients})
    

    您试图找到
    h3
    元素,其中
    成分
    作为类名,但它不存在

    只是好奇,为什么要使用Splitter?h3元素没有类
    成分
    。所以,这是错误的
    配料=网站汤。查找('h3',class=“配料”)
    只是好奇,为什么要使用Splitter?h3元素没有类
    配料。所以,这是错误的
    配料=网站\u汤。查找('h3',class=“配料”)