Python 我们如何从<；之后获取所有文本；br>；标签，包括<；u>；还可以使用XPath标记吗？_Python_Html_Linux_Xpath

Python 我们如何从<；之后获取所有文本；br>；标签，包括<；u>；还可以使用XPath标记吗？

python html linux xpath

Python 我们如何从<；之后获取所有文本；br>；标签，包括<；u>；还可以使用XPath标记吗？,python,html,linux,xpath,Python,Html,Linux,Xpath,示例HTML： <div class="apPageBottom"> <div id="refs">References<span class="icon-search"></span></div> <div id="refStash"> Medically reviewed by Joseph T. Palermo, D

示例HTML：

<div class="apPageBottom">
   <div id="refs">References<span class="icon-search"></span></div>
   <div id="refStash">
      Medically reviewed by Joseph T. Palermo, DO; Board Certified Internal Medicine/Geriatric Medicine
      <br /><br />
      REFERENCES:<br /><br />
      Braunwald, Eugene, et al. <u>Harrisons's Principles of Internal Medicine</u>. 15th ed. McGraw-Hill, 2010.<br /><br />
      FDA.gov. Computed Tomography.
      <br /><br />
      Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011. 
   </div>
</div>

我们尝试使用上述XPath，但缺少
标记数据
预期产量

您可以将以下XPath表达式用于

LXML

（第一个XPath是最安全的选项，最后一个是不太安全的选项）：

输出：

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.

{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}

或者直接获取整个文本（您可以

split

之后进行拆分）：

输出：

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.

{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}

编辑：进行一些测试后，要生成DIC列表，我们可以执行以下操作：

from lxml import html
import requests

dic={}
dic["references:"]=[]

page = requests.get('https://www.medicinenet.com/cat_scan/article.htm')
tree = html.fromstring(page.content)

buyers = int(tree.xpath('count(//div[@id="refStash"]/br)+2'))

variables = list(range(4,buyers,2))
for var in variables:
    result = [el.strip() for el in tree.xpath('//div[@id="refStash"]/br[$var]/following::text()[ancestor::div[@id="refStash"]][count(preceding::br[parent::div[@id="refStash"]])=$var]', var=var)]
    dic1={"text:":result}
    dic["references:"].append(dic1)

print (dic)

输出：

Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.

{
  'references:': [
    {
      'text:': [
        'Braunwald, Eugene, et al.',
        "Harrisons's Principles of Internal Medicine",
        '. 15th ed. McGraw-Hill, 2010.'
      ]
    },
    {
      'text:': [
        'FDA.gov. Computed Tomography.'
      ]
    },
    {
      'text:': [
        'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
      ]
    }
  ]
}

谢谢你的回答，但是我希望结果是我们上面提到的dicts格式的列表。这篇文章已经用一个解决方案编辑过了。脚本可以优化，但我认为这是一个很好的起点。：）