Python 使用Scrapy从具有多个子体的节点中刮取文本_Python_Html_Xpath_Scrapy

Python 使用Scrapy从具有多个子体的节点中刮取文本

python html xpath scrapy

Python 使用Scrapy从具有多个子体的节点中刮取文本,python,html,xpath,scrapy,Python,Html,Xpath,Scrapy,我正在尝试使用Scrapy从HTML表中刮取如下所示的行： <tr bgcolor="#F3F1E6"> <td class="htable_eng_text" align="center"> <a href="results.asp?racedate=02/02/2014&raceno=08&venue=ST" class="htable_eng_text"> 368 </a>

我正在尝试使用Scrapy从HTML表中刮取如下所示的行：

<tr bgcolor="#F3F1E6">

  <td class="htable_eng_text" align="center">
    <a href="results.asp?racedate=02/02/2014&amp;raceno=08&amp;venue=ST" class="htable_eng_text">
      368
    </a>
  </td>

  <td class="htable_eng_text" align="center">
    02/02/14
  </td>

  <td class="htable_eng_text" align="center" nowrap="">
    ST / 
    <font title="TURF">
      "Turf" / 
    </font>
    "C         "
  </td>

  <td class="htable_eng_text" align="center">
    <font class="htable_eng_rpnarrow_text">
      4
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;4
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;3
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;2
    </font>
    <font class="htable_eng_rpnarrow_text">
      &nbsp;&nbsp;5
    </font>
</tr>

我当前的Xpath尝试如下所示：

sel.xpath('td//text()[normalize-space()]').extract()

如果文本正好位于

标记内部，或者如果嵌套的标记没有分支（例如，第一个和第二个单元格），则可以正常工作。但如果单元格包含多个子体（例如第三个和第四个单元格），则会出现问题，因为我的Xpath为每个子体返回一个单独的元素，但我希望将它们连接在一起

我该怎么做？

>h=''
>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
...     text = ''.join(xs)
...     text = text.strip()
...     return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']

... 
... 
... ...
... 
... 
... '''
>>>
>>>从scrapy.selector导入选择器
>>>进口稀土
>>>def正常化（xs）：
...     text=''.join（xs）
...     text=text.strip（）
...     返回re.sub（r'[\s\xa0]+'，''，文本）
...
>>>root=选择器（text=h，type='html'）
>>>在root.xpath（'.//td'）中为x打印[normalize（x.xpath（'.//text（）'））.extract（））]
[u'368'，u'02/02/14'，u'ST/“草皮”/“C”，u'4425']

>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
...     text = ''.join(xs)
...     text = text.strip()
...     return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']