Python Scrapy只爬行第1页,不爬行其他页
呵呵,我正在使用scrapy制作一个项目,其中我需要从业务目录中删除业务详细信息Python Scrapy只爬行第1页,不爬行其他页,python,django,scrapy,Python,Django,Scrapy,呵呵,我正在使用scrapy制作一个项目,其中我需要从业务目录中删除业务详细信息 我面临的问题是:当我试图抓取页面时,我的抓取程序只获取第一页的详细信息,而我还需要获取其余9页的详细信息;总共10页。。 我在下面显示我的蜘蛛代码和items.py以及settings.py 请查看我的代码并帮助我解决它 蜘蛛代码:: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from pro
我面临的问题是:当我试图抓取页面时,我的抓取程序只获取第一页的详细信息,而我还需要获取其余9页的详细信息;总共10页。。 我在下面显示我的蜘蛛代码和items.py以及settings.py 请查看我的代码并帮助我解决它 蜘蛛代码::
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project2.items import Project2Item
class ProjectSpider(BaseSpider):
name = "project2spider"
allowed_domains = ["http://directory.thesun.co.uk/"]
start_urls = [
"http://directory.thesun.co.uk/find/uk/computer-repair"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="abTbl "]')
items = []
for site in sites:
item = Project2Item()
item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract()
item['Bussiness_name'] = site.select('a/@title').extract()
item['Description'] = site.select('span[last()]/text()').extract()
item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract()
item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract()
item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract()
item['Photo_name'] = site.select('img/@alt').extract()
item['Photo_path'] = site.select('img/@src').extract()
items.append(item)
return items
My items.py代码如下所示:
from scrapy.item import Item, Field
class Project2Item(Item):
Catogory = Field()
Bussiness_name = Field()
Description = Field()
Number = Field()
Web_url = Field()
adress_name = Field()
Photo_name = Field()
Photo_path = Field()
my settings.py是:
BOT_NAME = 'project2'
SPIDER_MODULES = ['project2.spiders']
NEWSPIDER_MODULE = 'project2.spiders'
请帮忙
我也可以从其他页面提取详细信息…如果您检查页面链接,它们看起来如下所示: 可以使用带有变量的urllib2循环页面
import urllib2
response = urllib2.urlopen('http://directory.thesun.co.uk/find/uk/computer-repair/page/' + page)
html = response.read()
并删除html。以下是工作代码。滚动页面应该通过研究 网站及其滚动结构,并据此应用。在这种情况下,网站为其提供了“/page/x”,其中x是页码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project2spider.items import Project2Item
from scrapy.http import Request
class ProjectSpider(BaseSpider):
name = "project2spider"
allowed_domains = ["http://directory.thesun.co.uk"]
current_page_no = 1
start_urls = [
"http://directory.thesun.co.uk/find/uk/computer-repair"
]
def get_next_url(self, fired_url):
if '/page/' in fired_url:
url, page_no = fired_url.rsplit('/page/', 1)
else:
if self.current_page_no != 1:
#end of scroll
return
self.current_page_no += 1
return "http://directory.thesun.co.uk/find/uk/computer-repair/page/%s" % self.current_page_no
def parse(self, response):
fired_url = response.url
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="abTbl "]')
for site in sites:
item = Project2Item()
item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract()
item['Bussiness_name'] = site.select('a/@title').extract()
item['Description'] = site.select('span[last()]/text()').extract()
item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract()
item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract()
item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract()
item['Photo_name'] = site.select('img/@alt').extract()
item['Photo_path'] = site.select('img/@src').extract()
yield item
next_url = self.get_next_url(fired_url)
if next_url:
yield Request(next_url, self.parse, dont_filter=True)
`
我尝试使用@nizam.sp的代码。已发布,这仅显示2条记录主页上的1条记录(最后一条记录)和第二页上的1条记录(随机记录),并结束