Python 解析由标记分隔的节_Python_Xpath_Web Scraping_Scrapy

Python 解析由标记分隔的节

python xpath web-scraping scrapy

Python 解析由标记分隔的节,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,我需要存储由标题分隔的元素。我正在努力制定一个xpath表达式或简单的解析器，它可以将我的项目分组到标题标记给出的部分中我知道如何在元素位于同一级别或元素级别由容器给定的情况下刮取列表，但我很难弄清楚如何在容器由元素分隔的情况下解析数据。例如： <div> <h1>section a</h1> <item>221</item> <item>453</item> <item>473</item&

我需要存储由标题分隔的元素。我正在努力制定一个xpath表达式或简单的解析器，它可以将我的项目分组到标题标记给出的部分中

我知道如何在元素位于同一级别或元素级别由容器给定的情况下刮取列表，但我很难弄清楚如何在容器由元素分隔的情况下解析数据。例如：

<div>
<h1>section a</h1>
<item>221</item>
<item>453</item>
<item>473</item>
<h1>section b</h1>
<item>430</item>
<item>493</item>
<h1>section c</h1>
<item>694</item>
<item>931</item>
</div>

是否有一些使用xpath记录结构的范例方法？有没有一种方法可以在scrapy选择器上迭代以查看dom视图并检测这些部分的开始和停止？

使用XPath的一种解决方案是计算

div

下节点的前

h1

同级节点，这些节点本身不是

h1

var header = null
var items = []

for each element in div
    if element is header
        process previous header, items
        header = the element text
        items = []
    else
        items append element text
end
process last header, items

$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""
<div>
<h1>section a</h1>
<item>221</item>
<item>453</item>
<item>473</item>
<h1>section b</h1>
<item>430</item>
<item>493</item>
<h1>section c</h1>
<item>694</item>
<item>931</item>
</div>""")

In [3]: for i, header in enumerate(selector.xpath('.//div/h1'), start=1):
    print header.xpath('normalize-space()').extract()
    between = selector.xpath(""".//div/node()[count(preceding-sibling::h1)=%d]
                                             [not(self::h1)]""" % i)
    print between.extract()
   ...:     
[u'section a']
[u'\n', u'<item>221</item>', u'\n', u'<item>453</item>', u'\n', u'<item>473</item>', u'\n']
[u'section b']
[u'\n', u'<item>430</item>', u'\n', u'<item>493</item>', u'\n']
[u'section c']
[u'\n', u'<item>694</item>', u'\n', u'<item>931</item>', u'\n']

$ipython
Python 2.7.6（默认，2014年3月22日，22:59:56）
有关详细信息，请键入“版权”、“信用”或“许可证”。
iPython1.2.1——一种增强的交互式Python。
?         -> 介绍和概述IPython的功能。
%快速参考->快速参考。
帮助->Python自己的帮助系统。
对象？->有关“对象”的详细信息，请使用“对象？？”获取更多详细信息。
在[1]中：输入刮屑
在[2]中：选择器=scrapy.selector（text=”“”
a节
221
453
473
b节
430
493
c节
694
931
""")
在[3]中：对于i，枚举中的头（selector.xpath（'.//div/h1'），start=1）：
print header.xpath（'normalize-space（）'）。extract（）
between=selector.xpath（“”.//div/node（）[count（前面的同级：：h1）=%d]
[非（自身：：h1）]“”%i）
在.extract（）之间打印
...:     
[u'部分a']
[u'\n'，u'221'，u'\n'，u'453'，u'\n'，u'473'，u'\n']
[u'段b']
[u'\n'，u'430'，u'\n'，u'493'，u'\n']
[u'c'部分]
[u'\n'，u'694'，u'\n'，u'931'，u'\n']

请参见我对类似问题的回答，网址为：