Python 无法从每个成分容器中分离某些字段_Python_Python 3.x_Regex_Web Scraping

Python 无法从每个成分容器中分离某些字段

python python-3.x regex web-scraping

Python 无法从每个成分容器中分离某些字段,python,python-3.x,regex,web-scraping,Python,Python 3.x,Regex,Web Scraping,我试图将三个字段分开，如name、unit和measure中的一些配料容器与一个容器中的三个字段。我使用BeautifulSoup解析配料容器，然后重新模块化以分离单元和度量。这是在那个网站上，我有兴趣从中获取这三个字段到目前为止，我就是这样尝试的： import re import requests from bs4 import BeautifulSoup link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosema

我试图将三个字段分开，如

name

、

unit

和

measure

中的一些配料容器与一个容器中的三个字段。我使用BeautifulSoup解析配料容器，然后重新模块化以分离

单元

和

度量

。这是在那个网站上，我有兴趣从中获取这三个字段

到目前为止，我就是这样尝试的：

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        ingr_unit_container = re.search(r"[\d.⁄a-z]+",ingr_container).group(0)
        ingr_name = re.sub(ingr_unit_container,"",ingr_container).strip()
        ingr_unit = re.sub(r"[a-z]+","",ingr_unit_container).strip()
        ingr_measure = re.sub(r"[\d.⁄]+","",ingr_unit_container).strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

配料容器如下：

500g potato gnocchi
2 tbs extra virgin olive oil
Finely grated zest and juice of 1 lemon
1⁄2 bunch basil, leaves picked
1 tbs finely chopped rosemary, plus fried rosemary leaves to serve
2 garlic cloves, crushed
50g grated pecorino, (or parmesan) plus extra to serve
50g roasted and chopped walnuts, plus extra to serve
100ml extra virgin olive oil

脚本从上述容器生成的当前输出：

('potato gnocchi', '500', 'g')
('tbs extra virgin olive oil', '2', '')
('F grated zest and juice of 1 lemon', '', 'inely')
('bunch basil, leaves picked', '1⁄2', '')
('tbs finely chopped rosemary, plus fried rosemary leaves to serve', '1', '')
('garlic cloves, crushed', '2', '')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

预期产出：

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

因此，一个解决方案是搜索文本中的数字，这就是度量。这变得有点棘手，因为有时单位是度量的一部分，有时单位之间有emtpy空间。但您可以通过条件（也可能有一个正则表达式解决方案）来了解这一点：

输出：

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

输出：

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

我的正则表达式一点也不好。但是，我发现以下实现工作正常：

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        unit_container = re.search(r'[\d.⁄]+\s*?[a-zA-Z]+\s*?',ingr_container).group(0)
        ingr_name = ingr_container.replace(unit_container,"").strip()
        ingr_unit = re.search(r'[\d.⁄]+',unit_container).group(0)
        ingr_measure = unit_container.replace(ingr_unit,"").strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

输出：

我的正则表达式一点也不好。但是，我发现以下实现工作正常：

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        unit_container = re.search(r'[\d.⁄]+\s*?[a-zA-Z]+\s*?',ingr_container).group(0)
        ingr_name = ingr_container.replace(unit_container,"").strip()
        ingr_unit = re.search(r'[\d.⁄]+',unit_container).group(0)
        ingr_measure = unit_container.replace(ingr_unit,"").strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

输出：