Python 使用Scrapy从具有多个子体的节点中刮取文本
我正在尝试使用Scrapy从HTML表中刮取如下所示的行:Python 使用Scrapy从具有多个子体的节点中刮取文本,python,html,xpath,scrapy,Python,Html,Xpath,Scrapy,我正在尝试使用Scrapy从HTML表中刮取如下所示的行: <tr bgcolor="#F3F1E6"> <td class="htable_eng_text" align="center"> <a href="results.asp?racedate=02/02/2014&raceno=08&venue=ST" class="htable_eng_text"> 368 </a>
<tr bgcolor="#F3F1E6">
<td class="htable_eng_text" align="center">
<a href="results.asp?racedate=02/02/2014&raceno=08&venue=ST" class="htable_eng_text">
368
</a>
</td>
<td class="htable_eng_text" align="center">
02/02/14
</td>
<td class="htable_eng_text" align="center" nowrap="">
ST /
<font title="TURF">
"Turf" /
</font>
"C "
</td>
<td class="htable_eng_text" align="center">
<font class="htable_eng_rpnarrow_text">
4
</font>
<font class="htable_eng_rpnarrow_text">
4
</font>
<font class="htable_eng_rpnarrow_text">
3
</font>
<font class="htable_eng_rpnarrow_text">
2
</font>
<font class="htable_eng_rpnarrow_text">
5
</font>
</tr>
我当前的Xpath尝试如下所示:
sel.xpath('td//text()[normalize-space()]').extract()
如果文本正好位于
标记内部,或者如果嵌套的标记没有分支(例如,第一个和第二个单元格),则可以正常工作。但如果单元格包含多个子体(例如第三个和第四个单元格),则会出现问题,因为我的Xpath为每个子体返回一个单独的元素,但我希望将它们连接在一起
我该怎么做?>h=''
>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
... text = ''.join(xs)
... text = text.strip()
... return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']
...
...
... ...
...
...
... '''
>>>
>>>从scrapy.selector导入选择器
>>>进口稀土
>>>def正常化(xs):
... text=''.join(xs)
... text=text.strip()
... 返回re.sub(r'[\s\xa0]+','',文本)
...
>>>root=选择器(text=h,type='html')
>>>在root.xpath('.//td')中为x打印[normalize(x.xpath('.//text()')).extract())]
[u'368',u'02/02/14',u'ST/“草皮”/“C”,u'4425']
>>> h = '''
... <table>
... <tr bgcolor="#F3F1E6">
... ...
... </tr>
... </table>
... '''
>>>
>>> from scrapy.selector import Selector
>>> import re
>>> def normalize(xs):
... text = ''.join(xs)
... text = text.strip()
... return re.sub(r'[\s\xa0]+', ' ', text)
...
>>> root = Selector(text=h, type='html')
>>> print [normalize(x.xpath('.//text()').extract()) for x in root.xpath('.//td')]
[u'368', u'02/02/14', u'ST / "Turf" / "C "', u'4 4 3 2 5']