Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用于抓取表列和行的Python Scrapy_Python - Fatal编程技术网

用于抓取表列和行的Python Scrapy

用于抓取表列和行的Python Scrapy,python,Python,我是python的一个新手,这是我第一次学习scrapy。我以前用perl成功地进行过数据挖掘,但这完全是另一回事 我正试着刮一张桌子,抓住每一行的列。我的代码如下 items.py from scrapy.item import Item, Field class Cio100Item(Item): company = Field() person = Field() industry = Field() url = Field() scrape.py(蜘蛛) 我很难

我是python的一个新手,这是我第一次学习scrapy。我以前用perl成功地进行过数据挖掘,但这完全是另一回事

我正试着刮一张桌子,抓住每一行的列。我的代码如下

items.py

from scrapy.item import Item, Field
class Cio100Item(Item):
   company = Field()
   person = Field()
   industry = Field()
   url = Field()
scrape.py(蜘蛛)

我很难理解如何正确地表达xpath选择

我认为这条线就是问题所在:

      tables = sel.xpath('//table[@class="bgWhite listTable"]//h2')
当我按上述方式运行刮板时,结果是在终端中得到如下结果:

2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>

{'company': [u"\nDomino's Pizza\n"],
 'industry': [u"\nDomino's Pizza\n"],
 'person': [u"\nDomino's Pizza\n"],
 'url': [u'/cio100/2013/dominos-pizza/']}

2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nColin Rees\n'],
 'industry': [u'\nColin Rees\n'],
 'person': [u'\nColin Rees\n'],
 'url': [u'/cio100/2013/dominos-pizza/']}
我明白了

2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nRetail\n'],
 'industry': [u'\nRetail\n'],
 'person': [u'\nRetail\n'],
 'url': [u'/cio100/2013/dominos-pizza/']}
2014-01-13 22:16:46-0500[scrape]调试:从
{'company':[u'\n详细信息\n'],
“行业”:[u'\n详细信息\n'],
“个人”:[u'\n详细信息\n'],
“url”:[u'/cio100/2013/dominos pizza/']}
在这里,它只生成1个块,并且正确地捕获了行业和URL。但它没有得到公司的名字或人

任何帮助都将不胜感激


谢谢
$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[@class="bgWhite listTable"]/tr'):
...     item = Cio100Item()
...     item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
...     item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
...     item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
...     item['url'] = tr.xpath('td[4]//a/@href').extract()[0].strip()
...     print item
... 
{'company': u'LOCOG',
 'industry': u'Leisure and entertainment',
 'person': u'Gerry Pennell',
 'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
 'industry': u'Leisure and entertainment',
 'person': u'Adam Gerrard',
 'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
 'industry': u'Communications and IT services',
 'person': u'Albert Hitchcock',
 'url': u'/cio100/2013/vodafone/'}
...

除此之外,您最好
逐个生成
项,而不是将它们累积到列表中

而不是调用extract然后获取第0项,您可以先调用extract\u。
2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nRetail\n'],
 'industry': [u'\nRetail\n'],
 'person': [u'\nRetail\n'],
 'url': [u'/cio100/2013/dominos-pizza/']}
$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[@class="bgWhite listTable"]/tr'):
...     item = Cio100Item()
...     item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
...     item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
...     item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
...     item['url'] = tr.xpath('td[4]//a/@href').extract()[0].strip()
...     print item
... 
{'company': u'LOCOG',
 'industry': u'Leisure and entertainment',
 'person': u'Gerry Pennell',
 'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
 'industry': u'Leisure and entertainment',
 'person': u'Adam Gerrard',
 'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
 'industry': u'Communications and IT services',
 'person': u'Albert Hitchcock',
 'url': u'/cio100/2013/vodafone/'}
...