Python 我们如何从<;之后获取所有文本;br>;标签,包括<;u>;还可以使用XPath标记吗?
示例HTML:Python 我们如何从<;之后获取所有文本;br>;标签,包括<;u>;还可以使用XPath标记吗?,python,html,linux,xpath,Python,Html,Linux,Xpath,示例HTML: <div class="apPageBottom"> <div id="refs">References<span class="icon-search"></span></div> <div id="refStash"> Medically reviewed by Joseph T. Palermo, D
<div class="apPageBottom">
<div id="refs">References<span class="icon-search"></span></div>
<div id="refStash">
Medically reviewed by Joseph T. Palermo, DO; Board Certified Internal Medicine/Geriatric Medicine
<br /><br />
REFERENCES:<br /><br />
Braunwald, Eugene, et al. <u>Harrisons's Principles of Internal Medicine</u>. 15th ed. McGraw-Hill, 2010.<br /><br />
FDA.gov. Computed Tomography.
<br /><br />
Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
</div>
</div>
- 我们尝试使用上述XPath,但缺少
标记数据 - 预期产量
您可以将以下XPath表达式用于
LXML
(第一个XPath是最安全的选项,最后一个是不太安全的选项):
输出:
Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
'references:': [
{
'text:': [
'Braunwald, Eugene, et al.',
"Harrisons's Principles of Internal Medicine",
'. 15th ed. McGraw-Hill, 2010.'
]
},
{
'text:': [
'FDA.gov. Computed Tomography.'
]
},
{
'text:': [
'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
]
}
]
}
或者直接获取整个文本(您可以split
之后进行拆分):
输出:
Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
'references:': [
{
'text:': [
'Braunwald, Eugene, et al.',
"Harrisons's Principles of Internal Medicine",
'. 15th ed. McGraw-Hill, 2010.'
]
},
{
'text:': [
'FDA.gov. Computed Tomography.'
]
},
{
'text:': [
'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
]
}
]
}
编辑:进行一些测试后,要生成DIC列表,我们可以执行以下操作:
from lxml import html
import requests
dic={}
dic["references:"]=[]
page = requests.get('https://www.medicinenet.com/cat_scan/article.htm')
tree = html.fromstring(page.content)
buyers = int(tree.xpath('count(//div[@id="refStash"]/br)+2'))
variables = list(range(4,buyers,2))
for var in variables:
result = [el.strip() for el in tree.xpath('//div[@id="refStash"]/br[$var]/following::text()[ancestor::div[@id="refStash"]][count(preceding::br[parent::div[@id="refStash"]])=$var]', var=var)]
dic1={"text:":result}
dic["references:"].append(dic1)
print (dic)
输出:
Braunwald, Eugene, et al. Harrisons's Principles of Internal Medicine. 15th ed. McGraw-Hill, 2010. FDA.gov. Computed Tomography. Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.
{
'references:': [
{
'text:': [
'Braunwald, Eugene, et al.',
"Harrisons's Principles of Internal Medicine",
'. 15th ed. McGraw-Hill, 2010.'
]
},
{
'text:': [
'FDA.gov. Computed Tomography.'
]
},
{
'text:': [
'Tramma, Simone, et al. "Helical CT Scans and Lung Cancer Screening." CDC NIOSH Science Blog. 10 Jan. 2011.'
]
}
]
}
谢谢你的回答,但是我希望结果是我们上面提到的dicts格式的列表。这篇文章已经用一个解决方案编辑过了。脚本可以优化,但我认为这是一个很好的起点。:)