Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/mysql/69.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 当html标记丢失时,如何使用scrapy提取标签值列表_Python_Xpath_Scrapy - Fatal编程技术网

Python 当html标记丢失时,如何使用scrapy提取标签值列表

Python 当html标记丢失时,如何使用scrapy提取标签值列表,python,xpath,scrapy,Python,Xpath,Scrapy,我目前正在处理一份带有 <b> label1 </b> value1 <br> <b> label2 </b> value2 <br> .... 但我不喜欢这种通过索引匹配两个列表的节点的方法。我宁愿遍历1个列表(值或标签),并将匹配的节点作为相对xpath查询。例如: values = section.select("text()[preceding-sibling::b/text()]"): for value in

我目前正在处理一份带有

<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
....
但我不喜欢这种通过索引匹配两个列表的节点的方法。我宁愿遍历1个列表(值或标签),并将匹配的节点作为相对xpath查询。例如:

values = section.select("text()[preceding-sibling::b/text()]"):
for value in values:
    value.select("/preceding-sibling::b/text()"):
我一直在调整这个表达式,但总是不返回匹配项

更新

我正在寻找能够容忍“噪音”的稳健方法,例如:

garbage1
标签1 值1
标签2 值2
垃圾2
标签3 值3
垃圾
编辑:很抱歉,我使用了lxml,但它在Scrapy自己的选择器中也可以使用

对于给定的特定HTML,这将起作用:

>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>> 

同样值得一提的是,如果要在for循环中的xpath中执行“/foo”操作,而不是“/foo”。编辑:抱歉,我使用了lxml,但它在Scrapy自己的选择器中也可以使用

对于给定的特定HTML,这将起作用:

>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>> 

同样值得一提的是,如果要在选定元素上循环,则要在for循环内的xpath中执行“/foo”,而不是“/foo”.

那么,您想要实现什么?找到一种稳健的方法来匹配标签和值,而不存在不匹配的风险。那么,您想要实现什么?找到一种稳健的方法来匹配标签和值,而不存在不匹配的风险。此实现与我的实现不同,但实际上并不更加稳健。请参阅问题中的更新。此实现与我的实现不同,但实际上并不更加健壮。请参见问题中的更新。
>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>> 
>>> bs = etree.xpath("//text()[preceding-sibling::b/text()]")
>>> for b in bs:
...     if b.getparent().tag == "b":
...         print [b.getparent().text, b]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
[' label3 ', '\nvalue3 ']