Web scraping 如何让BeautifulSoup中的刮板浏览所有链接并记录成分、营养信息和说明?

Web scraping 如何让BeautifulSoup中的刮板浏览所有链接并记录成分、营养信息和说明?,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,到目前为止,我的链接都是这样的: from bs4 import BeautifulSoup import urllib.request import re diabetesFile = urllib.request.urlopen("http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?referrer=http://www.diabetes.org/mfa-recipes/recipes/") diabetes

到目前为止,我的链接都是这样的:

from bs4 import BeautifulSoup
import urllib.request
import re



diabetesFile = urllib.request.urlopen("http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?referrer=http://www.diabetes.org/mfa-recipes/recipes/")
diabetesHtml = diabetesFile.read()
diabetesFile.close()

soup = BeautifulSoup((diabetesHtml), "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
    find = re.compile('/recipes/20(.*?)"')
    searchRecipe = re.search(find, str(link))
    recipe = searchRecipe.group(1)
    print (recipe)    
import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')


for div in soup.find_all('div', class_='ingredients'):
    print(div.text)
for div in soup.find_all('div', class_='nutritional_info'):
    print(div.text)
for div in soup.find_all('div', class_='instructions'):
    print(div.text)
这是一个将被刮取的页面示例:

from bs4 import BeautifulSoup
import urllib.request
import re



diabetesFile = urllib.request.urlopen("http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?referrer=http://www.diabetes.org/mfa-recipes/recipes/")
diabetesHtml = diabetesFile.read()
diabetesFile.close()

soup = BeautifulSoup((diabetesHtml), "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
    find = re.compile('/recipes/20(.*?)"')
    searchRecipe = re.search(find, str(link))
    recipe = searchRecipe.group(1)
    print (recipe)    
import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')


for div in soup.find_all('div', class_='ingredients'):
    print(div.text)
for div in soup.find_all('div', class_='nutritional_info'):
    print(div.text)
for div in soup.find_all('div', class_='instructions'):
    print(div.text)

我的主要目标是使用第一部分代码中的网站,从所有680个页面中获取所有链接,然后进入每个页面,收集第二部分代码中提供的信息。最后,我尝试将这些信息写入一个文本文件。提前谢谢你

我不打算为您编写整个刮板,但这里列出了您可以做的事情:
以下是包裹:

from bs4 import BeautifulSoup
import requests # i use this one instead of urllib
import re
获取页面的代码

req = requests.Session()
sauce = req.get('http://www.diabetes.org/mfa-recipes/recipes/2017-02-dijon-chicken-and-broccoli-and-noodles.html').read()
soup = BeautifulSoup(sauce, 'lxml')

search_link = re.compile("/recipes/20") # you might need to escape the slashes, i don't remember 

all_links_find = soup.find_all("a", href=search_link)
all_links_get = [link.get_text(strip=True) for link in all_links_find]
根据href,您应该将其值附加到基本url,或者不附加,如果它以开头,您不需要执行任何操作,否则您应该执行以下操作:

all_links = [baseurl + link for link in all_links_get]
对于其他页面,您可以对上述div使用
find
方法重复上述逻辑,但这次不是使用
get\u text(strip=True)
,而是使用类似
get\u text(“\n”,strip=True)
的方法进行漂亮的打印