Scrapy反转url Python中参数的顺序_Python_Asp.net_Python 2.7_Scrapy

Scrapy反转url Python中参数的顺序

python asp.net python-2.7 scrapy

Scrapy反转url Python中参数的顺序,python,asp.net,python-2.7,scrapy,Python,Asp.net,Python 2.7,Scrapy,我在用铲子从办公室的名册上爬过去办公室花名册地址如下所示：-但这是一个死页替换.aspx之后的两个部分。我甚至手动将每个地址显式加载为start_URL，但这种情况仍然存在我正在使用python-2.7上最新的Scrapy，Windows8.1 代码示例： class JLSSpider(CrawlSpider): name = 'JLS' allowed_domains = ['johnlscott.com'] # start_urls = ['http://w

我在用铲子从办公室的名册上爬过去

办公室花名册地址如下所示：-但这是一个死页替换.aspx之后的两个部分。

我甚至手动将每个地址显式加载为start_URL，但这种情况仍然存在

我正在使用python-2.7上最新的Scrapy，Windows8.1

代码示例：

class JLSSpider(CrawlSpider):

    name = 'JLS'
    allowed_domains = ['johnlscott.com']
    # start_urls = ['http://www.johnlscott.com/agent-search.aspx']

    rules = (
        Rule(callback="parse_start_url", follow=True),)

    def start_requests(self):
        with open('hrefnums.csv', 'rbU') as ifile:
            read = csv.reader(ifile)
            for row in read:
                for col in row:
                    # I have a csv of the office IDs: (Just letting it crawl through them creates the same issue)
                    yield self.make_requests_from_url("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=%s" % col)


    def parse_start_url(self, response):
        items = []
        sel = Selector(response)
        sections = sel.xpath("//tr/td/table[@id='tbAgents']/tr")
        for section in sections:
            item = JLSItem()
            item['name'] = section.xpath("td[2]/text()")[0].extract().replace(u'\xa0', ' ').strip()         
            items.append(item)
        return(items)

这样爬行没有问题：

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request


class JLSSpider(CrawlSpider):
    name = 'JLS'
    allowed_domains = ['johnlscott.com']

    def start_requests(self):
        yield Request("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=8627", callback=self.parse_item)

    def parse_item(self, response):
        print response.body

您可以使用代码中的选项

canonicalize=False

来防止url部分的交换：

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class JLSSpider(CrawlSpider):

    name = 'JLS'
    allowed_domains = ['johnlscott.com']
    start_urls = ['http://www.johnlscott.com/agent-search.aspx']

    rules = (
        # http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=7859
        Rule(
            LinkExtractor(
                allow=('p=agentResults.asp&OfficeID=', 
                ), 
                canonicalize=False
            ),
            callback='parse_roster',
            follow=True),
    )

    def parse_roster(self, response):
        pass

请改进您的问题，改进spider，并给出可以测试的正确代码。我很困惑。我在正文和标题中清楚地说明了这个问题-当Scrapy爬行任何URL时，“p=agentRestults.asp”和“OfficeID=XXXX”会切换。我包含了我认为相关的Spider代码的任何部分。我不是一个专业的程序员，但通过修改教程，我已经搜刮了几十个大型网站。你能更具体地说明我需要做什么或我做错了什么（关于这点）？我已经尽我所能补充了。这是一只蜘蛛，来自于scrapy教程基地。要让它爬行，请注释

start\u请求

函数并取消注释

start\u URL

列表。还必须创建JLSItem，但从日志中很容易看到它交换了url的各个部分。完全相同的代码，我得到了大量的

2015-04-22 12:59:09-0700[JLS]调试：爬网（200）（参考：http://www.johnlscott.com/agent-search.aspx)]

这样您就可以将解析代码添加到parse_花名册中以获取数据。出于沮丧，我最终将我的代码复制并粘贴到了您的代码中。。。它成功了。我唯一能想到的是我有更多的导入，比如导入LinkedExtractor。