Python 如何选择词典?

Python 如何选择词典?,python,html,xpath,html-parsing,lxml,Python,Html,Xpath,Html Parsing,Lxml,帮助请编写xpath表达式 html: 产品构成 93%聚酰胺7%弹性纤维 衬里:100%聚酯纤维连衣裙长度:90厘米 产品属性 :船领、长袖、Midi、拉链、隐藏式、系带、侧边 衬里类型:全衬里 这需要获取以下html字典: data['Product Composition'] = '93% Polyamide 7% Elastane Lining: 100% Polyester</p><p>Dress Length: 90 cm' data['Product A

帮助请编写xpath表达式

html:


产品构成

93%聚酰胺7%弹性纤维

衬里:100%聚酯纤维

连衣裙长度:90厘米

产品属性

:船领、长袖、Midi、拉链、隐藏式、系带、侧边

衬里类型:全衬里

这需要获取以下html字典:

data['Product Composition'] = '93% Polyamide 7% Elastane Lining: 100% Polyester</p><p>Dress Length: 90 cm'
data['Product Attributes;'] = ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'
数据[“产品成分”]=“93%聚酰胺7%弹性纤维衬里:100%聚酯纤维

连衣裙长度:90厘米” 数据['Product Attributes;']=':船领、长袖、Midi、拉链、隐藏式、花边、侧边衬里类型:全衬里'

元素的数量可以变化,这一点很重要。ie您需要一个通用的解决方案

p
中获取每个
strong
标记,然后获取它的父级和下一个父级的同级,直到有另一个
p
标记内有
strong
标记或不再有同级标记:

from lxml.html import fromstring


html_data = """<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>"""

tree = fromstring(html_data)
data = {}
for strong in tree.xpath('//p/strong'):
    parent = strong.getparent()

    description = []
    next_p = parent.getnext()
    while next_p is not None and not next_p.xpath('.//strong'):
        description.append(next_p.text)
        next_p = next_p.getnext()

    data[strong.text] = " ".join(description)

print data

但是元素和可能是不同的数字。现在是2,但可能是10,还有1
from lxml.html import fromstring


html_data = """<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>"""

tree = fromstring(html_data)
data = {}
for strong in tree.xpath('//p/strong'):
    parent = strong.getparent()

    description = []
    next_p = parent.getnext()
    while next_p is not None and not next_p.xpath('.//strong'):
        description.append(next_p.text)
        next_p = next_p.getnext()

    data[strong.text] = " ".join(description)

print data
{'Product Composition': '93% Polyamide 7% Elastane Lining: 100% Polyester', 
 'Product Attributes;': ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'}