Xpath 如何仅提取元素和文本(过滤掉属性、类、内嵌css)
运行这个Xpath 如何仅提取元素和文本(过滤掉属性、类、内嵌css),xpath,scrapy,Xpath,Scrapy,运行这个 hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract() 这是示例输出 <div class="OneLinkNoTx"> <strong>Location:</strong> Abu Dhabi, United Arab Emirates </div> <div class="OneLinkNoTx"> &
hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()
这是示例输出
<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>
位置:
阿拉伯联合酋长国阿布扎比
差旅百分比:
没有一个
工资:10万
我希望输出像这样
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>
位置:
阿拉伯联合酋长国阿布扎比
差旅百分比:
没有一个
工资:10万
我只想让html元素不带任何html属性。是否可以使用scrapy/xpath?您可以使用
请注意,
clean
会在适当的位置修改元素。XPath无法更改文档,但您可以使用它选择所有属性,并可能使用scrapy删除它们。选择所有属性的XPath表达式将是/@*
。
In [1]: import lxml.html
In [2]: import lxml.html.clean
In [3]: html = """<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>"""
In [4]: doc = lxml.html.fromstring(html)
In [5]: clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())
In [6]: clean(doc)
In [7]: print lxml.html.tostring(doc)
<div><div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div></div>
In [28]: elements = lxml.html.fragments_fromstring(html)
In [29]: map(clean, elements)
Out[29]: [None, None, None]
In [30]: print ''.join(map(lxml.html.tostring, elements))
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>