Python 在2个ul标签之间刮取数据
嗨,我正试着在标签之间刮擦。下面我附上一部分的来源,我想刮。如果你仔细看,有3个ul标签。第一个ul标签具有class=“listGroup”。我试图提取第二个“ul”标记的文本,其思想是后面跟着另一个具有类“listGroup”的“ul”标记。请分享我如何做到这一点Python 在2个ul标签之间刮取数据,python,html,web-scraping,html-parsing,Python,Html,Web Scraping,Html Parsing,嗨,我正试着在标签之间刮擦。下面我附上一部分的来源,我想刮。如果你仔细看,有3个ul标签。第一个ul标签具有class=“listGroup”。我试图提取第二个“ul”标记的文本,其思想是后面跟着另一个具有类“listGroup”的“ul”标记。请分享我如何做到这一点 <ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acc
<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>
您可以使用CSS选择器
ul.listGroup+ul li
->这将选择
标签旁边
标签的所有标签和类“listGroup”
:
这似乎是CSS选择器的自然用例,即:
ul.listGroup+ul-li
将选择第一个ul
标记中的所有li
标记,该标记位于类listGroup
的每个ul
标记之后。将+
替换为~
将选择所有li
标记(在本例中为2)ul
标记,每个标记后面都有类列表组
要在脚本中实现此答案,请将查找所有替换为选择,并使用相关CSS选择器更新选择器
导入请求
进口大熊猫
从bs4导入BeautifulSoup
有关['/疾病状况/甲状旁腺功能亢进/症状原因/syc-20356194']中的链接:
页面=请求。获取(f)https://www.mayoclinic.org{link}”)
soup=BeautifulSoup(page.content,“html.parser”)
对于汤中的每一个。选择(“ul.listGroup+ul li”):
打印(每个.text)
< /代码> 也许你应该考虑使用正则表达式来捕获。你说你正在寻找“第二个”UL标签的文本,它的想法是它后面跟着另一个“UL”标签,它有一个类“ListGROUP”;但是在您的示例中,第三个
标记没有类。
import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.find_all("ul"):
print(each)
txt = '''<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>'''
soup = BeautifulSoup(txt, 'html.parser')
for li in soup.select('ul.listGroup + ul li'):
print(li.text)
Osteoporosis
Kidney stones
Excessive urination
Abdominal pain
Tiring easily or weakness
Depression or forgetfulness
Bone and joint pain
Frequent complaints of illness with no apparent cause
Nausea, vomiting or loss of appetite