Python 我们如何从<;之后获取所有文本;br>;标签,包括<;u>;还可以使用XPath标记吗?

Python 我们如何从<;之后获取所有文本;br>;标签,包括<;u>;还可以使用XPath标记吗?,python,html,linux,xpath,Python,Html,Linux,Xpath,示例HTML: <div class="apPageBottom"> <div id="refs">References<span class="icon-search"></span></div> <div id="refStash"> Medically reviewed by Joseph T. Palermo, D

示例HTML:

<div class="apPageBottom">
   <div id="refs">References<span class="icon-search"></span></div>
   <div id="refStash">
      Medically reviewed by Joseph T. Palermo, DO; Board Certified Internal Medicine/Geriatric Medicine
      <br /><br />
      REFERENCES:<br /><br />
      Braunwald, Eugene, et al. <u>Harrisons's Principles of Internal Medicine</u>. 15th ed. McGraw-Hill, 2010.<br /><br />
      FDA.gov. Computed Tomography.
      <br /><br />
      Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011. 
   </div>
</div>
  • 我们尝试使用上述XPath,但缺少
    标记数据

  • 预期产量


您可以将以下XPath表达式用于
LXML
(第一个XPath是最安全的选项,最后一个是不太安全的选项):

输出:

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}

或者直接获取整个文本(您可以
split
之后进行拆分):

输出:

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}
编辑:进行一些测试后,要生成DIC列表,我们可以执行以下操作:

from lxml import html
import requests

dic={}
dic["references:"]=[]

page = requests.get('https://www.medicinenet.com/cat_scan/article.htm')
tree = html.fromstring(page.content)

buyers = int(tree.xpath('count(//div[@id="refStash"]/br)+2'))

variables = list(range(4,buyers,2))
for var in variables:
    result = [el.strip() for el in tree.xpath('//div[@id="refStash"]/br[$var]/following::text()[ancestor::div[@id="refStash"]][count(preceding::br[parent::div[@id="refStash"]])=$var]', var=var)]
    dic1={"text:":result}
    dic["references:"].append(dic1)

print (dic)
输出:

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}

谢谢你的回答,但是我希望结果是我们上面提到的dicts格式的列表。这篇文章已经用一个解决方案编辑过了。脚本可以优化,但我认为这是一个很好的起点。:)