Xpath 如何仅提取元素和文本(过滤掉属性、类、内嵌css)

Xpath 如何仅提取元素和文本(过滤掉属性、类、内嵌css),xpath,scrapy,Xpath,Scrapy,运行这个 hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract() 这是示例输出 <div class="OneLinkNoTx"> <strong>Location:</strong> Abu Dhabi, United Arab Emirates </div> <div class="OneLinkNoTx"> &

运行这个

hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()
这是示例输出

<div class="OneLinkNoTx">
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
    <strong>Travel Percentage:</strong> 
    None
</div>
<div align="justify">
    Salary: 100k
</div>

位置:
阿拉伯联合酋长国阿布扎比
差旅百分比:
没有一个
工资:10万
我希望输出像这样

<div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div>

位置:
阿拉伯联合酋长国阿布扎比
差旅百分比:
没有一个
工资:10万
我只想让html元素不带任何html属性。是否可以使用scrapy/xpath?

您可以使用


请注意,
clean
会在适当的位置修改元素。

XPath无法更改文档,但您可以使用它选择所有属性,并可能使用scrapy删除它们。选择所有属性的XPath表达式将是
/@*
In [1]: import lxml.html

In [2]: import lxml.html.clean

In [3]: html = """<div class="OneLinkNoTx">
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
    <strong>Travel Percentage:</strong> 
    None
</div>
<div align="justify">
    Salary: 100k
</div>"""

In [4]: doc = lxml.html.fromstring(html)

In [5]: clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())

In [6]: clean(doc)

In [7]: print lxml.html.tostring(doc)
<div><div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div></div>
In [28]: elements = lxml.html.fragments_fromstring(html)

In [29]: map(clean, elements)
Out[29]: [None, None, None]

In [30]: print ''.join(map(lxml.html.tostring, elements))
<div>
    <strong>Location:</strong> 
    Abu Dhabi, United Arab Emirates
</div>
<div>
    <strong>Travel Percentage:</strong> 
    None
</div>
<div>
    Salary: 100k
</div>