Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/sorting/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python r“]//h2//span[position()>1]///text())。extract()|//*//表[position()_Python_Sorting_Asynchronous_Hashmap_Scrapy - Fatal编程技术网

Python r“]//h2//span[position()>1]///text())。extract()|//*//表[position()

Python r“]//h2//span[position()>1]///text())。extract()|//*//表[position(),python,sorting,asynchronous,hashmap,scrapy,Python,Sorting,Asynchronous,Hashmap,Scrapy,解决方案是顺序的。 此解决方案类似于@wuliang 我开始使用@Alexis de Tréglodé方法,但遇到了一个问题: 您的start\u requests()方法返回URL列表这一事实 返回[start\u url中的start\u url的请求(url=start\u url)] 导致输出不连续(异步) 如果返回是单个响应,则通过创建替代的其他URL可以满足要求。此外,其他URL可以用于添加到从其他网页上刮取的URL中 from scrapy import log from scra

解决方案是顺序的。
此解决方案类似于@wuliang

我开始使用@Alexis de Tréglodé方法,但遇到了一个问题:
您的
start\u requests()
方法返回URL列表这一事实
返回[start\u url中的start\u url的请求(url=start\u url)]

导致输出不连续(异步)

如果返回是单个响应,则通过创建替代的
其他URL
可以满足要求。此外,
其他URL
可以用于添加到从其他网页上刮取的URL中

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items
Scrapy现在具有
优先级
属性

如果一个函数中有多个
请求
,并且希望首先处理特定请求,则可以设置:

def parse(self, response):
    url = 'http://www.example.com/first'
    yield Request(url=url, callback=self.parse_data, priority=1)

    url = 'http://www.example.com/second'
    yield Request(url=url, callback=self.parse_data)

Scrapy将首先处理优先级为1的解决方案。

我个人喜欢@user1460015的实现,因为我有自己的解决方案

我的解决方案是使用Python的子流程逐个url调用scrapy url,直到所有url都处理完毕

在我的代码中,如果用户没有指定他/她希望按顺序解析URL,我们可以以正常方式启动爬行器

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()
如果用户指定需要按顺序执行,我们可以执行以下操作:

for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
        + url + ' -o ' + outputfile)
    process.wait()

请注意:此实现不会处理错误。

有一种更简单的方法可以使scrapy遵循开始的顺序。\u url:您只需取消注释并将
settings.py
中的并发请求更改为1即可

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1

你能告诉我们你是怎么称呼你的蜘蛛的吗?>我有一个蜘蛛在多个网站上爬行,你是说多个起始URL?所有的反馈都很好,谢谢大家的帮助。这一个最接近我想要做的。我有一个相关的问题。假设我想指定一个URL列表,第一个是网站的主页,第二个是网页列表。我该怎么做?@prakhamohansrivastava,把它们放进去?或者你添加:
custom\u settings={'CONCURRENT\u REQUESTS':'1'}
就在
class-DmozSpider(BaseSpider):name=“dmoz>的正下方“
。这样您就不需要额外的
settings.py
文件。
settings.py是scrapy结构中的默认文件,而不是额外的文件。
from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items
def parse(self, response):
    url = 'http://www.example.com/first'
    yield Request(url=url, callback=self.parse_data, priority=1)

    url = 'http://www.example.com/second'
    yield Request(url=url, callback=self.parse_data)
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()
for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
        + url + ' -o ' + outputfile)
    process.wait()
Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1