解析从BeautifulSoup返回的JavaScript

解析从BeautifulSoup返回的JavaScript,javascript,python,beautifulsoup,html-parsing,Javascript,Python,Beautifulsoup,Html Parsing,我想解析这个网页来获取今天的午餐菜单。(我制作了一台Adafruit IoT热敏打印机,我想每天自动打印菜单。) 我最初是使用BeautifulSoup实现这一点的,但事实证明大部分数据都是用JavaScript加载的,我不确定BeautifulSoup是否能够处理它。如果查看源代码,您将看到存储在bootstrapData['menuMonthWeeks']中的相关数据 import urllib2 from BeautifulSoup import BeautifulSoup url =

我想解析这个网页来获取今天的午餐菜单。(我制作了一台Adafruit IoT热敏打印机,我想每天自动打印菜单。)

我最初是使用BeautifulSoup实现这一点的,但事实证明大部分数据都是用JavaScript加载的,我不确定BeautifulSoup是否能够处理它。如果查看源代码,您将看到存储在
bootstrapData['menuMonthWeeks']
中的相关数据

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())
这是获取源代码和审查的简单方法

我的问题是:提取这些数据的最简单方法是什么,这样我就可以用它做点什么?从字面上说,我想要的只是一个字符串,比如:

西南芝士煎蛋卷、土豆条、丰收吧(THB)、THB奶酪香蒜酱面包、火腿熟食三明治、红辣椒棒、草莓


我曾考虑过使用webkit来处理页面并获取HTML(即浏览器的功能),但这似乎不必要地复杂。我宁愿简单地找到一些可以解析
bootstrapData['menuMonthWeeks']
数据的东西。

像PhantomJS这样的东西可能更健壮,但下面是一些基本的Python代码,可以从完整的菜单中提取它:

import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

print menu
之后,您需要在菜单中搜索您感兴趣的日期

编辑:我有些过火了:

import itertools
import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

days = itertools.chain.from_iterable(menu['days'] for menu in menus)

day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)

if day:
    print '\n'.join(item['food']['description'] for item in day['menu_items'])
else:
    print 'Day not found.'

您所需要的只是一点字符串切片:

import json

soup = BeautifulSoup(urllib2.urlopen(url).read())
script = soup.findAll('script')[1].string
data = script.split("bootstrapData['menuMonthWeeks'] = ", 1)[-1].rsplit(';', 1)[0]
data = json.loads(data)

毕竟,JSON是JavaScript的一个子集。

如果没有BeautifulSoup,我们可以用一种简单的方法:

import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
    if "bootstrapData['menuMonthWeeks']" in line:
        data = json.loads(line.split("=")[1].strip('\n;'))
        print data[0]["last_updated"]
输出:

2013-11-11T11:18:13.636

要了解更通用的方法,请参阅而不要混淆
json
,如果您愿意,您可以尝试以下方法(不建议这样做):

import urllib2
import re

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
data = urllib2.urlopen(url).readlines()[60].partition('=')[2].strip()

foodlist = []

prev = 'name'
for i in re.findall('"([^"]*)"', data):
    if "The Harvest Bar (THB)" in i or i == "description" or i == "start_date":
        prev = i
        continue
    if prev == 'name':
        if i.startswith("THB - "):
            i = i[6:]
        foodlist.append(i)
    prev = i
我想这是你最终需要的:

Orange Chicken Bowl
Roasted Veggie Pesto Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Peaches
Southwest Cheese Omelet
Potato Wedges
Cheesy Pesto Bread
Ham Deli Sandwich
Red Pepper Sticks
Strawberries
Hamburger
Cheeseburger
Potato Wedges
Chicken Minestrone Soup
Veggie Deli Sandwich
Baked Beans
Green Beans
Fruit Cocktail
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Celery Sticks
Blueberries
Cowboy Mac
BYO Asian Salad
Sunbutter Sandwich
Stir Fry Vegetables
Pineapple Tidbits
Enchilada Blanco
Sausage & Black Olive Pizza
Cheese Sandwich & Yogurt Tube
Southwest Black Beans
Red Pepper Sticks
Applesauce
BBQ Roasted Chicken.
Hummus Cup w/  Pita bread
Ham Deli Sandwich
Mashed potatoes w/ gravy
Celery Sticks
Kiwi
Popcorn Chicken Bowl
Tuna Salad w/  Pita Bread
Veggie Deli Sandwich
Corn Niblets
Blueberries
Cheese Pizza
Pepperoni Pizza
BYO Chef Salad
BYO Vegetarian Chef Salad
Turkey Deli Sandwich
Steamed Cauliflower
Banana, Whole
Bosco Sticks
Chicken Egg Roll & Chow Mein Noodles
Sunbutter Sandwich
California Blend Vegetables
Fresh Pears
Baked Mac & Cheese
Italian Dunker
Ham Deli Sandwich
Red Pepper Sticks
Pineapple Tidbits
Hamburger
Cheeseburger
Baked Fries
BYO Taco Salad
Veggie Deli Sandwich
Baked Beans
Coleslaw
Fresh Grapes
Cheese Pizza
Pepperoni Pizza
Diced Chicken w/ Cornbread
Turkey Deli Sandwich
Steamed Cauliflower
Fruit Cocktail
French Dip w/ Au Jus
Baked Fries
Turkey Noodle Soup
Sunbutter Sandwich
Green Beans
Warm Cinnamon Apples
Rotisserie Chicken
Mashed potatoes w/ gravy
Bacon Cheeseburger Pizza
Cheese Sandwich & Yogurt Tube
Steamed Peas
Apple Wedges
Turkey Chili 
Cornbread Muffins
BYO Chef Salad
BYO Vegetarian Chef Salad
Ham Deli Sandwich
Celery Sticks
Fresh Pears
Beef, Bean & Red Chili Burrito
Popcorn Chicken & Breadstick
Veggie Deli Sandwich
California Blend Vegetables
Strawberries
Cheese Pizza
Pepperoni Pizza
Hummus Cup w/  Pita bread
Turkey Deli Sandwich
Green Beans
Orange Wedges
Bosco Sticks
Cheesy Bean Soft Taco Roll Up
Sunbutter Sandwich
Pinto Bean Cup
Baby Carrots
Blueberries
使用
json

import urllib2
import json
url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
for line in urllib2.urlopen(url):
    if "bootstrapData['menuMonthWeeks']" in line:
        data = json.loads(line.split("=")[1].strip('\n;'))
        print data[0]["name"]
    break

我意识到这大约是四年后的事了,但nutrislice(至少现在)有一个api可以直接从中获取JSON。您孩子几天前的午餐:

非常有用!需要更多的导入和URL定义,但最终这也能很好地获得该值。四年后对他们来说,六年后对我来说,我环顾四周,想看看我如何能很快敲出一些东西来抓起第二天的菜单。非常感谢。