Python 巨蟒/靓汤。从<;中提取所有文本;李>;h2和h3标签之间的标签
我想做什么:这个网站上有3个食品添加剂列表,我试图提取它们以获得3个不同的列表。它们位于Python 巨蟒/靓汤。从<;中提取所有文本;李>;h2和h3标签之间的标签,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想做什么:这个网站上有3个食品添加剂列表,我试图提取它们以获得3个不同的列表。它们位于和标记中,介于和标记之间。 我想找到第一个h2,将它下面的所有lis提取到一个列表中,当到达下一个h标记(h3)时,启动一个新列表并提取该列表下面的所有lis,然后继续第三个列表 我已经尝试过的:我四处阅读,发现了一个与我非常相似的问题。 我试着应用这个答案的逻辑,但对我来说不起作用 在开始制作列表之前,我正在运行print语句以查看输出是什么 import urllib.request as request
和
标记中,介于
和
标记之间。
我想找到第一个h2,将它下面的所有lis提取到一个列表中,当到达下一个h标记(h3)时,启动一个新列表并提取该列表下面的所有lis,然后继续第三个列表
我已经尝试过的:我四处阅读,发现了一个与我非常相似的问题。
我试着应用这个答案的逻辑,但对我来说不起作用
在开始制作列表之前,我正在运行print语句以查看输出是什么
import urllib.request as request
import bs4 as bs
sauce = request.urlopen("https://www.foodadditivesworld.com/articles/banned-food-additives.html").read()
soup = bs.BeautifulSoup(sauce, 'lxml')
firstH2 = soup.find('h2') # Start here
# print(firstH2.text)
# print(firstH2.findNextSiblings())
uls = []
for sib in firstH2.findNextSiblings():
# print(child.name)
if sib.name=='h3':
print(sib)
break
elif sib.name == 'div':
print(sib.text)
continue
for c in sib.descendants:
if c.name=='li':
print (c)
发生了什么:代码基本上是在做我想要的事情,但是它应该在第一次遇到h3标记时中断,但是没有,它会在停止之前继续到第二个h3标记。为什么它缺少第一次出现?您可以刮除
h2
和ul
标记,然后使用itertools。groupby
:
import requests, itertools, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.foodadditivesworld.com/articles/banned-food-additives.html').text, 'html.parser')
_, *data = [[i.name, i] for i in d.find_all(re.compile('h2|h3|ul'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[0] == 'h2' or x[0] == 'h3')]
new_result = [[new_data[i][-1][0][-1].text, [c.text for b in new_data[i+1][-1] for c in b[-1].find_all('li')]] for i in range(0, len(new_data), 2)]
输出:
[['Banned Food Additives in US', ['Calamus extract', 'Calamus oil', 'Calcium cyclamate', 'Chlorofluorocarbons', 'cinnamyl anthranilate', 'Cobaltous chloride', 'Cobalt sulfate', 'Coumarin', 'Cyclamate', 'Diethyl pyrocarbonatec', 'Dulcin', 'Fd&c green no. 1', 'Fd&c green no. 2', 'Fd&c red no. 3, aluminum lake', 'CFd&c red no. 3, calcium lake', 'Fd&c red no. 1', 'Fd&c red no. 2', 'Fd&c red no. 4', 'Fd&c violet no. 1', 'Magnesium cyclamate', 'Nordihydroguaiaretic acid', 'Potassium cyclamate', 'P-4000', 'Safrole', 'Sodium cyclamate', 'Thiourea']], ['UK Food Additives Banned in Australia and New Zealand', ['E131 Patent Blue V', 'E154 Brown FK', 'E161g Canthaxanthin', 'E180 Litholrubine BK']], ['Preservatives', ['E214 Â\xa0 Ethyl p-hydroxybenzoate', 'E215 Â\xa0 Sodium ethyl p-hydroxybenzoate', 'E219 Â\xa0 Sodium methyl p-hydroxybenzoate', 'E226 Â\xa0 Calcium sulphite', 'E227 Â\xa0 Calcium hydrogen sulphite', 'E230 Â\xa0 Biphenyl; diphenyl', 'E231 Â\xa0 Orthophenyl phenol', 'E232 Â\xa0 Sodium orthophenyl phenol', 'E239 Â\xa0 Hexamethylene tetramine', 'E284 Â\xa0 Boric acid', 'E285 Â\xa0 Sodium tetraborate; borax', 'E356 Â\xa0 Sodium adipate antioxidant']], ['Stabilisers, Thickeners and Gelling Agents Emulsifiers', ['E417 Â\xa0 Tara gum', 'E425 Â\xa0 Konjac', 'E426 Â\xa0 Soybean hemicellulose', 'E226 Â\xa0 Calcium sulphite', 'E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20', 'E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40', 'E459 Â\xa0 Beta-cyclodextrin', 'E462 Â\xa0 Ethyl cellulose', 'E468 Â\xa0 Crosslinked sodium carboxy methyl cellulose', 'E472d Â\xa0 Tartaric acid esters of mono- and diglycerides of fatty acids', 'E474 Â\xa0 Sucroglycerides', 'E483 Â\xa0 Stearyl tartrate', 'E493 Â\xa0 Sorbitan monolaurate', 'E494 Â\xa0 Sorbitan monooleate', 'E495 Â\xa0 Sorbitan monopalmitate', 'E513 Â\xa0 Sulphuric acid', 'E517 Â\xa0 Ammonium sulphate', 'E520 Â\xa0 Aluminium sulphate', 'E521 Â\xa0 Aluminium sodium sulphate', 'E522 Â\xa0 Aluminium potassium sulphate', 'E523 Â\xa0 Aluminium ammonium sulphate', 'E524 Â\xa0 Sodium hydroxide', 'E525 Â\xa0 Potassium hydroxide', 'E527 Â\xa0 Ammonium hydroxide', 'E528 Â\xa0 Magnesium hydroxide', 'E538 Â\xa0 Calcium ferrocyanide', 'E553a Â\xa0 (i) Magnesium silicate', 'E553b Â\xa0 Talc E574 Â\xa0 Gluconic acid', 'E576 Â\xa0 Sodium gluconate', 'E585 Â\xa0 Ferrous lactate', 'E626 Â\xa0 Guanylic acid', 'E628 Â\xa0 Dipotassium guanylate', 'E629 Â\xa0 Calcium guanylate', 'E630 Â\xa0 lnosinic acid', 'E632 Â\xa0 Dipotassium inosinate', 'E633 Â\xa0 Calcium inosinate', "E634 Â\xa0 Calcium 5'-ribonucleotides", 'E650 Â\xa0 Zinc acetate', 'E900 Â\xa0 Dimethylpolysiloxane', 'E902 Â\xa0 Candelilla wax', 'E905 Â\xa0 Microcrystalline wax', 'E912 Â\xa0 Montan acid esters', 'E927b Â\xa0 Carbamide', 'E938 Â\xa0 Argon', 'E939 Â\xa0 Helium', 'E948 Â\xa0 Oxygen', 'E949 Â\xa0 Hydrogen', 'E959 Â\xa0 Neohesperidine DC', 'E962 Â\xa0 Salt of aspartame-acesulfame', 'E999 Â\xa0 Quillaia extract', 'E1103 Â\xa0 Invertase', 'E1202 Â\xa0 Polyvinylpolypyrrolidone', 'E1204 Â\xa0 Pullulan', 'E1451 Â\xa0 Acetylated oxidised starch', 'E1452 Â\xa0 Starch aluminium Octenyl succinate', 'Annatto ExtractM', 'Anthocyanins', 'Lake Allura Red', 'Lake Amaranth', 'Solvent Black 5', 'Solvent Black 7', 'Pigment Fast Yellow G', 'Pigment Green B', 'FD&C; Blue No.2 ', 'FD&C; Blue No.1 ', 'Beverages ', 'Confectionery ', 'Anticaking Agents ', 'Color Retention Agents ']]]
Banned Food Additives in US
-Calamus extract
-Calamus oil
-Calcium cyclamate
-Chlorofluorocarbons
-cinnamyl anthranilate
-Cobaltous chloride
-Cobalt sulfate
-Coumarin
-Cyclamate
-Diethyl pyrocarbonatec
-Dulcin
-Fd&c green no. 1
-Fd&c green no. 2
-Fd&c red no. 3, aluminum lake
-CFd&c red no. 3, calcium lake
-Fd&c red no. 1
-Fd&c red no. 2
-Fd&c red no. 4
-Fd&c violet no. 1
-Magnesium cyclamate
-Nordihydroguaiaretic acid
-Potassium cyclamate
-P-4000
-Safrole
-Sodium cyclamate
-Thiourea
UK Food Additives Banned in Australia and New Zealand
-E131 Patent Blue V
-E154 Brown FK
-E161g Canthaxanthin
-E180 Litholrubine BK
Preservatives
-E214 Â Ethyl p-hydroxybenzoate
-E215 Â Sodium ethyl p-hydroxybenzoate
-E219 Â Sodium methyl p-hydroxybenzoate
-E226 Â Calcium sulphite
-E227 Â Calcium hydrogen sulphite
-E230 Â Biphenyl; diphenyl
-E231 Â Orthophenyl phenol
-E232 Â Sodium orthophenyl phenol
-E239 Â Hexamethylene tetramine
-E284 Â Boric acid
-E285 Â Sodium tetraborate; borax
-E356 Â Sodium adipate antioxidant
Stabilisers, Thickeners and Gelling Agents Emulsifiers
-E417 Â Tara gum
-E425 Â Konjac
-E426 Â Soybean hemicellulose
-E226 Â Calcium sulphite
-E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20
-E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40
-E459 Â Beta-cyclodextrin
-E462 Â Ethyl cellulose
-E468 Â Crosslinked sodium carboxy methyl cellulose
-E472d  Tartaric acid esters of mono- and diglycerides of fatty acids
-E474 Â Sucroglycerides
-E483 Â Stearyl tartrate
-E493 Â Sorbitan monolaurate
-E494 Â Sorbitan monooleate
-E495 Â Sorbitan monopalmitate
-E513 Â Sulphuric acid
-E517 Â Ammonium sulphate
-E520 Â Aluminium sulphate
-E521 Â Aluminium sodium sulphate
-E522 Â Aluminium potassium sulphate
-E523 Â Aluminium ammonium sulphate
-E524 Â Sodium hydroxide
-E525 Â Potassium hydroxide
-E527 Â Ammonium hydroxide
-E528 Â Magnesium hydroxide
-E538 Â Calcium ferrocyanide
-E553a  (i) Magnesium silicate
-E553b  Talc E574  Gluconic acid
-E576 Â Sodium gluconate
-E585 Â Ferrous lactate
-E626 Â Guanylic acid
-E628 Â Dipotassium guanylate
-E629 Â Calcium guanylate
-E630 Â lnosinic acid
-E632 Â Dipotassium inosinate
-E633 Â Calcium inosinate
-E634 Â Calcium 5'-ribonucleotides
-E650 Â Zinc acetate
-E900 Â Dimethylpolysiloxane
-E902 Â Candelilla wax
-E905 Â Microcrystalline wax
-E912 Â Montan acid esters
-E927b  Carbamide
-E938 Â Argon
-E939 Â Helium
-E948 Â Oxygen
-E949 Â Hydrogen
-E959 Â Neohesperidine DC
-E962 Â Salt of aspartame-acesulfame
-E999 Â Quillaia extract
-E1103 Â Invertase
-E1202 Â Polyvinylpolypyrrolidone
-E1204 Â Pullulan
-E1451 Â Acetylated oxidised starch
-E1452 Â Starch aluminium Octenyl succinate
-Annatto ExtractM
-Anthocyanins
-Lake Allura Red
-Lake Amaranth
-Solvent Black 5
-Solvent Black 7
-Pigment Fast Yellow G
-Pigment Green B
-FD&C; Blue No.2
-FD&C; Blue No.1
-Beverages
-Confectionery
-Anticaking Agents
-Color Retention Agents
打印结果:
print('\n\n'.join(' {}\n{}'.format(a, '\n'.join(f'\t-{i}' for i in b)) for a, b in new_result))
输出:
[['Banned Food Additives in US', ['Calamus extract', 'Calamus oil', 'Calcium cyclamate', 'Chlorofluorocarbons', 'cinnamyl anthranilate', 'Cobaltous chloride', 'Cobalt sulfate', 'Coumarin', 'Cyclamate', 'Diethyl pyrocarbonatec', 'Dulcin', 'Fd&c green no. 1', 'Fd&c green no. 2', 'Fd&c red no. 3, aluminum lake', 'CFd&c red no. 3, calcium lake', 'Fd&c red no. 1', 'Fd&c red no. 2', 'Fd&c red no. 4', 'Fd&c violet no. 1', 'Magnesium cyclamate', 'Nordihydroguaiaretic acid', 'Potassium cyclamate', 'P-4000', 'Safrole', 'Sodium cyclamate', 'Thiourea']], ['UK Food Additives Banned in Australia and New Zealand', ['E131 Patent Blue V', 'E154 Brown FK', 'E161g Canthaxanthin', 'E180 Litholrubine BK']], ['Preservatives', ['E214 Â\xa0 Ethyl p-hydroxybenzoate', 'E215 Â\xa0 Sodium ethyl p-hydroxybenzoate', 'E219 Â\xa0 Sodium methyl p-hydroxybenzoate', 'E226 Â\xa0 Calcium sulphite', 'E227 Â\xa0 Calcium hydrogen sulphite', 'E230 Â\xa0 Biphenyl; diphenyl', 'E231 Â\xa0 Orthophenyl phenol', 'E232 Â\xa0 Sodium orthophenyl phenol', 'E239 Â\xa0 Hexamethylene tetramine', 'E284 Â\xa0 Boric acid', 'E285 Â\xa0 Sodium tetraborate; borax', 'E356 Â\xa0 Sodium adipate antioxidant']], ['Stabilisers, Thickeners and Gelling Agents Emulsifiers', ['E417 Â\xa0 Tara gum', 'E425 Â\xa0 Konjac', 'E426 Â\xa0 Soybean hemicellulose', 'E226 Â\xa0 Calcium sulphite', 'E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20', 'E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40', 'E459 Â\xa0 Beta-cyclodextrin', 'E462 Â\xa0 Ethyl cellulose', 'E468 Â\xa0 Crosslinked sodium carboxy methyl cellulose', 'E472d Â\xa0 Tartaric acid esters of mono- and diglycerides of fatty acids', 'E474 Â\xa0 Sucroglycerides', 'E483 Â\xa0 Stearyl tartrate', 'E493 Â\xa0 Sorbitan monolaurate', 'E494 Â\xa0 Sorbitan monooleate', 'E495 Â\xa0 Sorbitan monopalmitate', 'E513 Â\xa0 Sulphuric acid', 'E517 Â\xa0 Ammonium sulphate', 'E520 Â\xa0 Aluminium sulphate', 'E521 Â\xa0 Aluminium sodium sulphate', 'E522 Â\xa0 Aluminium potassium sulphate', 'E523 Â\xa0 Aluminium ammonium sulphate', 'E524 Â\xa0 Sodium hydroxide', 'E525 Â\xa0 Potassium hydroxide', 'E527 Â\xa0 Ammonium hydroxide', 'E528 Â\xa0 Magnesium hydroxide', 'E538 Â\xa0 Calcium ferrocyanide', 'E553a Â\xa0 (i) Magnesium silicate', 'E553b Â\xa0 Talc E574 Â\xa0 Gluconic acid', 'E576 Â\xa0 Sodium gluconate', 'E585 Â\xa0 Ferrous lactate', 'E626 Â\xa0 Guanylic acid', 'E628 Â\xa0 Dipotassium guanylate', 'E629 Â\xa0 Calcium guanylate', 'E630 Â\xa0 lnosinic acid', 'E632 Â\xa0 Dipotassium inosinate', 'E633 Â\xa0 Calcium inosinate', "E634 Â\xa0 Calcium 5'-ribonucleotides", 'E650 Â\xa0 Zinc acetate', 'E900 Â\xa0 Dimethylpolysiloxane', 'E902 Â\xa0 Candelilla wax', 'E905 Â\xa0 Microcrystalline wax', 'E912 Â\xa0 Montan acid esters', 'E927b Â\xa0 Carbamide', 'E938 Â\xa0 Argon', 'E939 Â\xa0 Helium', 'E948 Â\xa0 Oxygen', 'E949 Â\xa0 Hydrogen', 'E959 Â\xa0 Neohesperidine DC', 'E962 Â\xa0 Salt of aspartame-acesulfame', 'E999 Â\xa0 Quillaia extract', 'E1103 Â\xa0 Invertase', 'E1202 Â\xa0 Polyvinylpolypyrrolidone', 'E1204 Â\xa0 Pullulan', 'E1451 Â\xa0 Acetylated oxidised starch', 'E1452 Â\xa0 Starch aluminium Octenyl succinate', 'Annatto ExtractM', 'Anthocyanins', 'Lake Allura Red', 'Lake Amaranth', 'Solvent Black 5', 'Solvent Black 7', 'Pigment Fast Yellow G', 'Pigment Green B', 'FD&C; Blue No.2 ', 'FD&C; Blue No.1 ', 'Beverages ', 'Confectionery ', 'Anticaking Agents ', 'Color Retention Agents ']]]
Banned Food Additives in US
-Calamus extract
-Calamus oil
-Calcium cyclamate
-Chlorofluorocarbons
-cinnamyl anthranilate
-Cobaltous chloride
-Cobalt sulfate
-Coumarin
-Cyclamate
-Diethyl pyrocarbonatec
-Dulcin
-Fd&c green no. 1
-Fd&c green no. 2
-Fd&c red no. 3, aluminum lake
-CFd&c red no. 3, calcium lake
-Fd&c red no. 1
-Fd&c red no. 2
-Fd&c red no. 4
-Fd&c violet no. 1
-Magnesium cyclamate
-Nordihydroguaiaretic acid
-Potassium cyclamate
-P-4000
-Safrole
-Sodium cyclamate
-Thiourea
UK Food Additives Banned in Australia and New Zealand
-E131 Patent Blue V
-E154 Brown FK
-E161g Canthaxanthin
-E180 Litholrubine BK
Preservatives
-E214 Â Ethyl p-hydroxybenzoate
-E215 Â Sodium ethyl p-hydroxybenzoate
-E219 Â Sodium methyl p-hydroxybenzoate
-E226 Â Calcium sulphite
-E227 Â Calcium hydrogen sulphite
-E230 Â Biphenyl; diphenyl
-E231 Â Orthophenyl phenol
-E232 Â Sodium orthophenyl phenol
-E239 Â Hexamethylene tetramine
-E284 Â Boric acid
-E285 Â Sodium tetraborate; borax
-E356 Â Sodium adipate antioxidant
Stabilisers, Thickeners and Gelling Agents Emulsifiers
-E417 Â Tara gum
-E425 Â Konjac
-E426 Â Soybean hemicellulose
-E226 Â Calcium sulphite
-E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20
-E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40
-E459 Â Beta-cyclodextrin
-E462 Â Ethyl cellulose
-E468 Â Crosslinked sodium carboxy methyl cellulose
-E472d  Tartaric acid esters of mono- and diglycerides of fatty acids
-E474 Â Sucroglycerides
-E483 Â Stearyl tartrate
-E493 Â Sorbitan monolaurate
-E494 Â Sorbitan monooleate
-E495 Â Sorbitan monopalmitate
-E513 Â Sulphuric acid
-E517 Â Ammonium sulphate
-E520 Â Aluminium sulphate
-E521 Â Aluminium sodium sulphate
-E522 Â Aluminium potassium sulphate
-E523 Â Aluminium ammonium sulphate
-E524 Â Sodium hydroxide
-E525 Â Potassium hydroxide
-E527 Â Ammonium hydroxide
-E528 Â Magnesium hydroxide
-E538 Â Calcium ferrocyanide
-E553a  (i) Magnesium silicate
-E553b  Talc E574  Gluconic acid
-E576 Â Sodium gluconate
-E585 Â Ferrous lactate
-E626 Â Guanylic acid
-E628 Â Dipotassium guanylate
-E629 Â Calcium guanylate
-E630 Â lnosinic acid
-E632 Â Dipotassium inosinate
-E633 Â Calcium inosinate
-E634 Â Calcium 5'-ribonucleotides
-E650 Â Zinc acetate
-E900 Â Dimethylpolysiloxane
-E902 Â Candelilla wax
-E905 Â Microcrystalline wax
-E912 Â Montan acid esters
-E927b  Carbamide
-E938 Â Argon
-E939 Â Helium
-E948 Â Oxygen
-E949 Â Hydrogen
-E959 Â Neohesperidine DC
-E962 Â Salt of aspartame-acesulfame
-E999 Â Quillaia extract
-E1103 Â Invertase
-E1202 Â Polyvinylpolypyrrolidone
-E1204 Â Pullulan
-E1451 Â Acetylated oxidised starch
-E1452 Â Starch aluminium Octenyl succinate
-Annatto ExtractM
-Anthocyanins
-Lake Allura Red
-Lake Amaranth
-Solvent Black 5
-Solvent Black 7
-Pigment Fast Yellow G
-Pigment Green B
-FD&C; Blue No.2
-FD&C; Blue No.1
-Beverages
-Confectionery
-Anticaking Agents
-Color Retention Agents
这是一些非常简洁的代码!要理解它需要一段时间。谢谢你的回答。@Jimmy9zz很高兴能帮忙!请允许我澄清一下:为什么要使用“,*data”作为变量?@Jimmy9zz
是一个一次性变量,在这种情况下,这是一种完全忽略[[I.name,I]对于d.find_all(re.compile('h2 | h3 | ul')]
的第一个结果的方法。第一个结果本身实际上是菜单下拉列表的ul
内容,因此不需要。