Python r“]//h2//span[position()>1]///text())。extract()|//*//表[position()
解决方案是顺序的。Python r“]//h2//span[position()>1]///text())。extract()|//*//表[position(),python,sorting,asynchronous,hashmap,scrapy,Python,Sorting,Asynchronous,Hashmap,Scrapy,解决方案是顺序的。 此解决方案类似于@wuliang 我开始使用@Alexis de Tréglodé方法,但遇到了一个问题: 您的start\u requests()方法返回URL列表这一事实 返回[start\u url中的start\u url的请求(url=start\u url)] 导致输出不连续(异步) 如果返回是单个响应,则通过创建替代的其他URL可以满足要求。此外,其他URL可以用于添加到从其他网页上刮取的URL中 from scrapy import log from scra
此解决方案类似于@wuliang 我开始使用@Alexis de Tréglodé方法,但遇到了一个问题:
您的
start\u requests()
方法返回URL列表这一事实返回[start\u url中的start\u url的请求(url=start\u url)]
导致输出不连续(异步) 如果返回是单个响应,则通过创建替代的
其他URL
可以满足要求。此外,其他URL
可以用于添加到从其他网页上刮取的URL中
from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem
log.start()
class PracticeSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
other_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
]
def start_requests(self):
log.msg('Starting Crawl!', level=log.INFO)
start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
return [Request(start_urls, meta={'items': []})]
def parse(self, response):
log.msg("Begin Parsing", level=log.INFO)
log.msg("Response from: %s" % response.url, level=log.INFO)
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[@id='moduleData8460']")
items = response.meta['items']
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
items.append(item)
# here we .pop(0) the next URL in line
if self.other_urls:
return Request(self.other_urls.pop(0), meta={'items': items})
return items
Scrapy现在具有优先级
属性
如果一个函数中有多个请求
,并且希望首先处理特定请求,则可以设置:
def parse(self, response):
url = 'http://www.example.com/first'
yield Request(url=url, callback=self.parse_data, priority=1)
url = 'http://www.example.com/second'
yield Request(url=url, callback=self.parse_data)
Scrapy将首先处理优先级为1的解决方案。我个人喜欢@user1460015的实现,因为我有自己的解决方案 我的解决方案是使用Python的子流程逐个url调用scrapy url,直到所有url都处理完毕 在我的代码中,如果用户没有指定他/她希望按顺序解析URL,我们可以以正常方式启动爬行器
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()
如果用户指定需要按顺序执行,我们可以执行以下操作:
for url in urls:
process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
+ url + ' -o ' + outputfile)
process.wait()
请注意:此实现不会处理错误。有一种更简单的方法可以使scrapy遵循开始的顺序。\u url:您只需取消注释并将
settings.py
中的并发请求更改为1即可
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
你能告诉我们你是怎么称呼你的蜘蛛的吗?>我有一个蜘蛛在多个网站上爬行,你是说多个起始URL?所有的反馈都很好,谢谢大家的帮助。这一个最接近我想要做的。我有一个相关的问题。假设我想指定一个URL列表,第一个是网站的主页,第二个是网页列表。我该怎么做?@prakhamohansrivastava,把它们放进去?或者你添加:
custom\u settings={'CONCURRENT\u REQUESTS':'1'}
就在class-DmozSpider(BaseSpider):name=“dmoz>的正下方“
。这样您就不需要额外的settings.py
文件。settings.py是scrapy结构中的默认文件,而不是额外的文件。
from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem
log.start()
class PracticeSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
other_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
]
def start_requests(self):
log.msg('Starting Crawl!', level=log.INFO)
start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
return [Request(start_urls, meta={'items': []})]
def parse(self, response):
log.msg("Begin Parsing", level=log.INFO)
log.msg("Response from: %s" % response.url, level=log.INFO)
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[@id='moduleData8460']")
items = response.meta['items']
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
items.append(item)
# here we .pop(0) the next URL in line
if self.other_urls:
return Request(self.other_urls.pop(0), meta={'items': items})
return items
def parse(self, response):
url = 'http://www.example.com/first'
yield Request(url=url, callback=self.parse_data, priority=1)
url = 'http://www.example.com/second'
yield Request(url=url, callback=self.parse_data)
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()
for url in urls:
process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
+ url + ' -o ' + outputfile)
process.wait()
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1