Python 在h2标记的两个选择器之间提取文本

Python 在h2标记的两个选择器之间提取文本,python,xpath,web-scraping,scrapy,css-selectors,Python,Xpath,Web Scraping,Scrapy,Css Selectors,我试图从中提取文本。其基本思想是将h2标记作为键,将该特定h2标记下的所有文本作为值。我想做的是: for cnt, h2 in enumerate(response.xpath('//h2'), start=1): print h2.xpath('./following-sibling::node()[count(preceding-sibling::h2)=%d]/descendant-or-self::text()' % cnt).getall() 但这会生成包含文本的列表,我认

我试图从中提取文本。其基本思想是将h2标记作为键,将该特定h2标记下的所有文本作为值。我想做的是:

for cnt, h2 in enumerate(response.xpath('//h2'), start=1):
    print h2.xpath('./following-sibling::node()[count(preceding-sibling::h2)=%d]/descendant-or-self::text()' % cnt).getall()
但这会生成包含文本的列表,我认为循环需要嵌套文本,因此我尝试了以下代码:

h2 = response.xpath('//h2//text()').getall()
p={}
for cnt, h3 in enumerate(response.xpath('//h2'), start=1):
 text=''
 for t in h3.xpath('./following-sibling::node()[count(preceding-sibling::h2)=%d]//text()' % cnt).getall():
    if t in h2:
        p[t] = text
        text=''
    else:
        text=text+t       
它在某种程度上产生了一些结果,但不可接受

更新:

预期产出如下:

{'Contact your healthcare provider if:': 'You have pain in your lower back.You are a man and have pain in your testicles.You have pain when you urinate.You have questions or concerns about your condition or care.',
 'Follow up with your healthcare provider within 24 hours or as directed:': 'Write down your questions so you remember to ask them during your visits.',
 'Further information': 'Always consult your healthcare provider to ensure the information displayed on this page applies to your personal circumstances.',
 'Medicines:': '\xc2\xa9 Copyright IBM Corporation 2020 Information is for End Users use only and may not be sold, redistributed or otherwise used for commercial purposes. All illustrations and images included in CareNotes\xc2\xae are the copyrighted property of A.D.A.M., Inc. or IBM Watson Health.Medicines may be given to calm your stomach and prevent vomiting or to decrease pain. Ask how to take pain medicine safely.Take your medicine as directed. Contact your healthcare provider if you think your medicine is not helping or if you have side effects. Tell him of her if you are allergic to any medicine. Keep a list of the medicines, vitamins, and herbs you take. Include the amounts, and when and why you take them. Bring the list or the pill bottles to follow-up visits. Carry your medicine list with you in case of an emergency.The above information is an educational aid only. It is not intended as medical advice for individual conditions or treatments. Talk to your doctor, nurse or pharmacist before following any medical regimen to see if it is safe and effective for you.',
 'Return to the emergency department if:': 'You have new chest pain or shortness of breath.You have pulsing pain in your upper abdomen or lower back that suddenly becomes constant.Your pain is in the right lower abdominal area and worsens with movement.You have a fever over 100.4\xc2\xb0F (38\xc2\xb0C) or shaking chills.You are vomiting and cannot keep food or liquids down.Your pain does not improve or gets worse over the next 8 to 12 hours.You see blood in your vomit or bowel movements, or they look black and tarry.Your skin or the whites of your eyes turn yellow.You are a woman and have a large amount of vaginal bleeding that is not your monthly period.',
 'WHAT YOU NEED TO KNOW:': 'Abdominal pain can be dull, achy, or sharp. You may have pain in one area of your abdomen, or in your entire abdomen. Your pain may be caused by a condition such as constipation, food sensitivity or poisoning, infection, or a blockage. Abdominal pain can also be from a hernia, appendicitis, or an ulcer. Liver, gallbladder, or kidney conditions can also cause abdominal pain. The cause of your abdominal pain may be unknown.'}

您可以使用此示例将
及其值提取到字典:

import requests
from bs4 import BeautifulSoup


url = 'https://www.drugs.com/cg/abdominal-pain-aftercare-instructions.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# extract Ads that mess with the article:
for ad in soup.select('.contentAd'):
    ad.extract()

out = {}
for tag in soup.select('#content h2 + *:not(h2), #moreResources'):
    if tag.get('id', '') == 'moreResources':
        break # we don't want anything after this tag
    out[tag.find_previous('h2').text] = tag.get_text(strip=True)

# pretty print the output to screen:
from pprint import pprint
pprint(out, width=120)
印刷品:

{'Contact your healthcare provider if:': 'You have pain in your lower back.You are a man and have pain in your '
                                         'testicles.You have pain when you urinate.You have questions or concerns '
                                         'about your condition or care.',
 'Follow up with your healthcare provider within 24 hours or as directed:': 'Write down your questions so you remember '
                                                                            'to ask them during your visits.',
 'Further information': 'Always consult your healthcare provider to ensure the information displayed on this page '
                        'applies to your personal circumstances.',
 'Medicines:': 'Medicinesmay be given to calm your stomach and prevent vomiting or to decrease pain. Ask how to take '
               'pain medicine safely.Take your medicine as directed.Contact your healthcare provider if you think your '
               'medicine is not helping or if you have side effects. Tell him of her if you are allergic to any '
               'medicine. Keep a list of the medicines, vitamins, and herbs you take. Include the amounts, and when '
               'and why you take them. Bring the list or the pill bottles to follow-up visits. Carry your medicine '
               'list with you in case of an emergency.',
 'Return to the emergency department if:': 'You have new chest pain or shortness of breath.You have pulsing pain in '
                                           'your upper abdomen or lower back that suddenly becomes constant.Your pain '
                                           'is in the right lower abdominal area and worsens with movement.You have a '
                                           'fever over 100.4°F (38°C) or shaking chills.You are vomiting and cannot '
                                           'keep food or liquids down.Your pain does not improve or gets worse over '
                                           'the next 8 to 12 hours.You see blood in your vomit or bowel movements, or '
                                           'they look black and tarry.Your skin or the whites of your eyes turn '
                                           'yellow.You are a woman and have a large amount of vaginal bleeding that is '
                                           'not your monthly period.',
 'WHAT YOU NEED TO KNOW:': 'Abdominal pain can be dull, achy, or sharp. You may have pain in one area of your abdomen, '
                           'or in your entire abdomen. Your pain may be caused by a condition such as constipation, '
                           'food sensitivity or poisoning, infection, or a blockage. Abdominal pain can also be from a '
                           'hernia, appendicitis, or an ulcer. Liver, gallbladder, or kidney conditions can also cause '
                           'abdominal pain. The cause of your abdominal pain may be unknown.'}

编辑:新版本:

import requests
from bs4 import BeautifulSoup


url = 'https://www.drugs.com/mcd/asthma-attack'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# extract Ads that mess with the article:
for ad in soup.select('.contentAd'):
    ad.extract()

out = {}
tag = soup.select_one('h2')
current_header = tag.text
while True:
    tag = tag.find_next_sibling()
    if not tag:
        break
    if tag.get('id', '') == 'moreResources':
        break # we don't want anything after this tag
    if tag.name == 'h2':
        current_header = tag.text
    else:
        out.setdefault(current_header, '')
        out[current_header] += tag.get_text(strip=True)

您可以使用BeautifulSoup完成任务吗?@AndrejKesely您能告诉我如何使用BeautifulSoup吗?您能编辑您的问题并将预期结果放在那里吗?@AndrejKesely已更新!非常感谢你的努力。我刚刚检查了这个解决方案是否只适用于给定的网页。如果你注意到了这一点,它就不会产生预期的效果。@TaimorMughal我编辑了我的答案,你可以试试新版本。