Python 使用Scrapy XPATH获取属性名称_Python_Xpath_Scrapy

Python 使用Scrapy XPATH获取属性名称

python xpath scrapy

Python 使用Scrapy XPATH获取属性名称,python,xpath,scrapy,Python,Xpath,Scrapy,我正在尝试获取XML文件中某个标记的属性的键和值（使用scrapy和xpath）标记类似于： <element attr1="value1" attr2="value2 ...> 我正在尝试获取XML文件中某个标记的属性的键和值（使用scrapy和xpath）您需要@*，这意味着“任何属性”。XPath表达式//element/@*将为您提供元素element的所有属性，以及这些属性的值。Short version >>> for element in se

我正在尝试获取XML文件中某个标记的属性的键和值（使用scrapy和xpath）

标记类似于：

<element attr1="value1" attr2="value2 ...>


我正在尝试获取XML文件中某个标记的属性的键和值（使用scrapy和xpath）
您需要@*
，这意味着“任何属性”。XPath表达式//element/@*
将为您提供元素element
的所有属性，以及这些属性的值。
Short version
>>> for element in selector.xpath('//element'):
...     attributes = []
...     # loop over all attribute nodes of the element
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         # use XPath's name() string function on each attribute,
...         # using their position
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         # Scrapy's extract() on an attribute returns its value
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes # list of (attribute name, attribute value) tuples
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>> 

长版本
>>> for element in selector.xpath('//element'):
...     attributes = []
...     # loop over all attribute nodes of the element
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         # use XPath's name() string function on each attribute,
...         # using their position
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         # Scrapy's extract() on an attribute returns its value
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes # list of (attribute name, attribute value) tuples
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>> 

XPath具有获取节点名称（）的方法：
name函数返回一个字符串，该字符串包含一个QName，表示参数节点集中按文档顺序排在第一位的节点的扩展名称。（…如果忽略了参数，则默认为一个节点集，其中上下文节点是其唯一成员
（来源：）
注意：我不知道这是Scrapy在幕后使用的lxml/libxml2
的限制，还是XPath规范不允许。（我不明白为什么会这样。）
不过，您可以使用名称（节点集）
表单，即使用非空节点集作为参数。如果仔细阅读我在上面粘贴的XPath 1.0规范部分，与其他字符串函数一样，name（节点集）
只考虑节点集中的第一个节点（按文档顺序）：
属性节点也有位置，因此可以按其位置在所有属性上循环。这里我们有2个（上下文节点上的count（@*）
的结果）：
现在，您可以猜到我们可以做什么：为每个@*[i]

>>> for element in selector.xpath('//element'):
...     for i in range(1, 2+1):
...         print element.xpath('name(@*[%d])' % i).extract_first()
... 
attr1
attr2
>>> 

如果将所有这些放在一起，并假设@*
将按文档顺序获取属性（我认为XPath 1.0规范中没有提到，但我看到的是lxml
），那么您将得到以下结果：
>>> attributes = []
>>> for element in selector.xpath('//element'):
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>> 

>>> for element in selector.xpath('//element'):
...     print element.xpath('count(@*)').extract_first()
... 
2.0
>>> for element in selector.xpath('//element'):
...     for i in range(1, 2+1):
...         print element.xpath('@*[%d]' % i).extract_first()
... 
value1
value2
>>> 

>>> for element in selector.xpath('//element'):
...     for i in range(1, 2+1):
...         print element.xpath('name(@*[%d])' % i).extract_first()
... 
attr1
attr2
>>> 

>>> attributes = []
>>> for element in selector.xpath('//element'):
...     for index, attribute in enumerate(element.xpath('@*'), start=1):
...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first()
...         attributes.append((attribute_name, attribute.extract()))
... 
>>> attributes
[(u'attr1', u'value1'), (u'attr2', u'value2')]
>>> dict(attributes)
{u'attr2': u'value2', u'attr1': u'value1'}
>>>