Pagination 抓取分页页面的刮擦回调_Pagination_Scrapy

Pagination 抓取分页页面的刮擦回调

pagination scrapy

Pagination 抓取分页页面的刮擦回调,pagination,scrapy,Pagination,Scrapy,我写了一个蜘蛛爬一个网站。我能够生成所有页面URL（分页）。我需要帮助抓取所有这些页面，然后打印响应 url_字符串=”http://website.com/ct-50658/page-" 类蜘蛛名称（蜘蛛）： name=“网站” 允许的_域=[“website.com”] 起始URL=[”http://website.com/page-2"] def打印机（自我，响应）： hxs=HtmlXPathSelector（响应） x=hxs.select（“//span/a/@title”）.e

我写了一个蜘蛛爬一个网站。我能够生成所有页面URL（分页）。我需要帮助抓取所有这些页面，然后打印响应

url_字符串=”http://website.com/ct-50658/page-"

类蜘蛛名称（蜘蛛）：
name=“网站”
允许的_域=[“website.com”]
起始URL=[”http://website.com/page-2"]
def打印机（自我，响应）：
hxs=HtmlXPathSelector（响应）
x=hxs.select（“//span/a/@title”）.extract（）
将open（'website.csv'，'wb'）作为csvfile：
spamwriter=csv.writer（csvfile，分隔符=''，quotechar='|'，quoting=csv.QUOTE|）
对于x中的i：
spamwriter.writerow（一）
def解析（自我，响应）：
hxs=HtmlXPathSelector（响应）
#sel=选择器（响应）
pages=hxs.select（“//div[@id='srchpagination']/a/@href”）.extract（）
总页数=int（页数[-2][-2:]）
j=0
url_list=[]
（j您正在为每个一个url
请求的响应重新创建'website.csv'文件。您可能应该创建它一次（例如在\uuu init\uuu
中），并在爬行器的属性中保存对它的csv编写器引用，引用def打印机中的self.csvwriter之类的内容
另外，在中，对于url列表中的一个url:
循环，您应该使用屈服请求（一个url，callback=self.printer）
。在这里，您只返回最后一个请求
下面是一个示例spider，其中包含这些修改和一些代码简化：
class SpiderName(Spider):
    name="website"
    allowed_domains=["website.com"]
    start_urls=["http://website.com/page-2"]

    def __init__(self, category=None, *args, **kwargs):
        super(SpiderName, self).__init__(*args, **kwargs)
        self.spamwriter = csv.writer(open('website.csv', 'wb'),
                                     delimiter=' ',
                                     quotechar='|',
                                     quoting=csv.QUOTE_MINIMAL)

    def printer(self, response):
        hxs = HtmlXPathSelector(response)
        for i in hxs.select("//span/a/@title").extract():
            self.spamwriter.writerow(i)

    def parse(self,response):
        hxs=HtmlXPathSelector(response)
        #sel=Selector(response)
        pages = hxs.select("//div[@id='srchpagination']/a/@href").extract()
        total_pages = int(pages[-2][-2:])
        while j in range(0, total_pages):
            yield Request(url_string+str(j), callback=self.printer)

您正在为每个一个url
响应重新创建“website.csv”文件。您应该创建一次（例如在\uuuu init\uuuu
中）并在spider的属性中保存对它的CSV编写器引用，在def printer
中引用类似于self.csvwriter
的内容。此外，在中，对于url列表中的一个url:
您应该使用生成请求（一个url，回调=self.printer）。在这里，您只返回最后一个请求。请您详细解释一下。非常感谢Paul。对我来说很有用。这正是我遇到的问题
class SpiderName(Spider):
    name="website"
    allowed_domains=["website.com"]
    start_urls=["http://website.com/page-2"]

    def __init__(self, category=None, *args, **kwargs):
        super(SpiderName, self).__init__(*args, **kwargs)
        self.spamwriter = csv.writer(open('website.csv', 'wb'),
                                     delimiter=' ',
                                     quotechar='|',
                                     quoting=csv.QUOTE_MINIMAL)

    def printer(self, response):
        hxs = HtmlXPathSelector(response)
        for i in hxs.select("//span/a/@title").extract():
            self.spamwriter.writerow(i)

    def parse(self,response):
        hxs=HtmlXPathSelector(response)
        #sel=Selector(response)
        pages = hxs.select("//div[@id='srchpagination']/a/@href").extract()
        total_pages = int(pages[-2][-2:])
        while j in range(0, total_pages):
            yield Request(url_string+str(j), callback=self.printer)