Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/294.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用Scrapy从具有多个子体的节点中刮取文本_Python_Html_Xpath_Scrapy - Fatal编程技术网

Python 使用Scrapy从具有多个子体的节点中刮取文本

Python 使用Scrapy从具有多个子体的节点中刮取文本,python,html,xpath,scrapy,Python,Html,Xpath,Scrapy,我正在尝试使用Scrapy从HTML表中刮取如下所示的行: <tr bgcolor="#F3F1E6"> <td class="htable_eng_text" align="center"> <a href="results.asp?racedate=02/02/2014&amp;raceno=08&amp;venue=ST" class="htable_eng_text"> 368 </a>

我正在尝试使用Scrapy从HTML表中刮取如下所示的行:

<tr bgcolor="#F3F1E6">

  <td class="htable_eng_text" align="center">
    <a href="results.asp?racedate=02/02/2014&amp;raceno=08&amp;venue=ST" class="htable_eng_text">
      368
    </a>
  </td>

  <td class="htable_eng_text" align="center">
    02/02/14
  </td>

  <td class="htable_eng_text" align="center" nowrap="">
    ST / 
    <font title="TURF">
      "Turf" / 
    </font>
    "C         "
  </td>

  <td class="htable_eng_text" align="center">
    <font class="htable_eng_rpnarrow_text">
      4
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;4
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;3
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;2
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;5
    </font>
</tr>
我当前的Xpath尝试如下所示:

sel.xpath('td//text()[normalize-space()]').extract()
如果文本正好位于
标记内部,或者如果嵌套的标记没有分支(例如,第一个和第二个单元格),则可以正常工作。但如果单元格包含多个子体(例如第三个和第四个单元格),则会出现问题,因为我的Xpath为每个子体返回一个单独的元素,但我希望将它们连接在一起

我该怎么做?

>h=''
>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
...     text = ''.join(xs)
...     text = text.strip()
...     return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']
... ... ... ... ... ... ... ''' >>> >>>从scrapy.selector导入选择器 >>>进口稀土 >>>def正常化(xs): ... text=''.join(xs) ... text=text.strip() ... 返回re.sub(r'[\s\xa0]+','',文本) ... >>>root=选择器(text=h,type='html') >>>在root.xpath('.//td')中为x打印[normalize(x.xpath('.//text()')).extract())] [u'368',u'02/02/14',u'ST/“草皮”/“C”,u'4425']
>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
...     text = ''.join(xs)
...     text = text.strip()
...     return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']