Python 通过字典中的HTML进行解析

Python 通过字典中的HTML进行解析,python,beautifulsoup,Python,Beautifulsoup,我正在尝试从以下网站提取表格数据: 虽然没有表标记,但我发现将表的各个段拉到div class=accord con的公共标记 我制作了一本字典,其中键是毕业年份(即2019年、2018年等),值是每个div类的html 我被卡住了,不知道如何在字典中解析html。我的目标是每年有一份专科、医院和地点的单独清单。我不知道如何前进 以下是我的工作代码: import numpy as np import bs4 as bs from bs4 import BeautifulSoup import

我正在尝试从以下网站提取表格数据:

虽然没有表标记,但我发现将表的各个段拉到div class=accord con的公共标记

我制作了一本字典,其中键是毕业年份(即2019年、2018年等),值是每个div类的html

我被卡住了,不知道如何在字典中解析html。我的目标是每年有一份专科、医院和地点的单独清单。我不知道如何前进

以下是我的工作代码:

import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
    grad_yr_list.append(header.h2.text[-4:])

rez_classes = soup.find_all('div', class_={'accord-con'})

data_dict = dict(zip(grad_yr_list, rez_classes))
以下是我的字典的示例:

{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
 '2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
{'2019':麻醉学
  • 纽约州布法罗市布法罗医学院大学
    • 急诊医学
      • 佛罗里达州阿文图拉市阿文图拉医院
        • 家庭医学
          • 路易斯安那州立大学医学院,新奥尔良,洛杉矶
          • 犹他州墨弗里斯伯勒,田纳西州西雅图市圣托马斯医院
          • 纽约州纽约市纽约大学医学院
            • 急诊医学
              • 罗得岛沃里克肯特医院
              • 康涅狄格大学医学院,法明顿,CT
              • 德克萨斯州圣安东尼奥市德克萨斯大学健康科学中心北卡罗来纳州埃维尔市家庭医学堪萨斯州威奇托市堪萨斯大学医学中心纽约州斯克内克塔迪市埃利斯医院华盛顿州西雅图哈里森医学中心威斯康星州威尔明顿圣弗朗西斯医院弗吉尼亚大学弗吉尼亚州夏洛茨维尔市山谷医学中心内科学i> 俄勒冈州波特兰健康和科学大学,弗吉尼亚州里士满弗吉尼亚联邦大学卫生系统,伊利诺伊州芝加哥市芝加哥大学医学中心妇产科CT哈特福德圣弗朗西斯医院俄亥俄州新泽西州海王星市泽西肖尔大学医学中心马里兰州巴尔的摩马里兰大学医学中心弗吉尼亚大学夏洛茨维尔分校北卡罗来纳州格林维尔东卡罗来纳大学维丹特医学中心华盛顿特区霍华德大学医院

我的最终目标是将这些数据拉入包含以下列的pandas数据框:毕业年份、专科、医院、地点您有带BS元素的字典(
'bs4.element.Tag'
),您不必解析它们

您可以直接使用
find()
find_all()

for key, value in data_dict.items():
    print(type(value), key, value.find('h4').text)
结果

<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine

您有带BS元素的字典(
'bs4.element.Tag'
),您不必解析它们

您可以直接使用
find()
find_all()

for key, value in data_dict.items():
    print(type(value), key, value.find('h4').text)
结果

<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine

您的代码很接近找到最终结果。一旦您将年份与学生安置数据配对,只需对后者应用提取函数即可:

from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
   r = block.find_all(re.compile('ul|h4'))
   return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}

result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
输出:

{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}

注意:我最终使用了
selenium
,因为对我来说,
请求.get
返回的HTML响应不包括呈现的学生安置数据。

您的代码很接近找到最终结果。一旦您将年份与学生安置数据配对,只需对后者应用提取函数即可

from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
   r = block.find_all(re.compile('ul|h4'))
   return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}

result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
输出:

{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}

注意:我最终使用了
selenium
,因为对我来说,
请求.get
返回的HTML响应不包括呈现的学生安置数据。

你可以在拿到汤后转到熊猫,然后解析必要的信息

df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])

之后你必须做一些熊猫魔术

一旦你得到了汤,你就可以去熊猫那里,然后解析必要的信息

df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])

在那之后,你必须做一些熊猫魔术。

我不知道熊猫。下面的代码可以得到表中的数据。我不知道这是否能满足你的需要

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
  grad_year = div.h2.text[-4:]
  rez_classe = div.getElementByClass('accord-con')
  h4s = rez_classe.h4s # get h4
  for h4 in h4s:
    if not h4.next: 
      continue
    lis = h4.next.lis
    specialty = h4.text
    hospital = [li.text for li in lis]
    datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
  print (data,datas[data])

我不知道熊猫。下面的代码可以获取表中的数据。我不知道这是否能满足您的需要

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
  grad_year = div.h2.text[-4:]
  rez_classe = div.getElementByClass('accord-con')
  h4s = rez_classe.h4s # get h4
  for h4 in h4s:
    if not h4.next: 
      continue
    lis = h4.next.lis
    specialty = h4.text
    hospital = [li.text for li in lis]
    datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
  print (data,datas[data])

我被卡住了,不知道如何解析字典中的html。-只需从字符串解析:
>>soup=BeautifulSoup(myHtmlString)
你检查了dict中的内容了吗?至于我,不需要解析的不是字符串而是BS对象,你可以将它与
value.find()一起使用
,等等。我被卡住了,不知道如何解析字典中的html。-只需从字符串解析:
>>soup=BeautifulSoup(myHtmlString)
你检查过dict中的内容了吗?至于我,不需要解析的不是字符串而是BS对象,你可以将其与
value.find(),
value.find()
,等等。如何获得每年的所有位置,使用you方法,我只获得每年的第一个匹配项谢谢!尝试了您的解决方案,并收到以下错误消息:“'Doctype'对象没有属性'text'哪一行失败?第二行失败