Python Scrapy spider没有收集第一页数据，每页上的第一项也可能不正确_Python_Scrapy Spider

Python Scrapy spider没有收集第一页数据，每页上的第一项也可能不正确

python

Python Scrapy spider没有收集第一页数据，每页上的第一项也可能不正确,python,scrapy-spider,Python,Scrapy Spider,这只蜘蛛从有趣的subreddit页面上取下标题。我认为问题可能与允许的url有关，因为/fully主页与此不匹配。如果我在允许的列表中添加“/r/funcy/”through，它会变得疯狂，爬行得太多。此外，不确定如何理解每页的第一项错误（有时可能是上一页的最后一项） from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.http.res

这只蜘蛛从有趣的subreddit页面上取下标题。我认为问题可能与允许的url有关，因为/fully主页与此不匹配。如果我在允许的列表中添加“/r/funcy/”through，它会变得疯狂，爬行得太多。此外，不确定如何理解每页的第一项错误（有时可能是上一页的最后一项）

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.response import Response

class Lesson1Spider(CrawlSpider):
    name = 'lesson1'
    allowed_domains = ['www.reddit.com']
    start_urls = ['http://www.reddit.com/r/funny/']

    rules = [
        Rule(LinkExtractor(
            allow=['/r/funny/\?count=\d*&after=\w*',]),
            callback='parse_item',
            follow=True ),
    ]

    def parse_item (self, response):
        print(response.xpath('//p[@class="title"]/a/text()').extract())

对于问题的第一部分，您的规则似乎与起始url相矛盾。第一页-

http://www.reddit.com/r/funny/

-没有

/r/funcy/\？count=\d*&after=\w*

，因此它可能会跳过它。使用每页底部的prev/next按钮指定下一页可能会获得更好的结果

至于第二部分，可能是在你刮和检查之间，reddit上的排名在变化，或者源代码中有什么东西（其他

元素有类名吗？）您没有考虑到这一点。

我能找到的最佳解决方案是在同一个文件中运行两个独立的爬行器。一个用于第一页，另一个用于第2页。这可能不是最有效的代码，但由于是新代码，我很高兴所有组件都能正常工作。欢迎提供任何关于清理它的建议

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.response import Response
import csv
import sys

print("Which subreddit do you want to scrape?")
subreddit = sys.stdin.readline()
subreddit = subreddit.strip()

class Lesson1Spider(scrapy.Spider):
    name = "lesson1"
    allowed_domains = ['www.reddit.com']
    start_urls = ['http://www.reddit.com/r/%s/' % subreddit]

    def parse(self, response):
        SET_SELECTOR = '//p[@class="title"]/a/text()'
        with open('redditscan8.csv', 'a', encoding='utf-8') as csvfile:
            kelly = csv.writer(csvfile, dialect ='excel')

kelly.writerow(response.xpath('//p[@class="title"]/a/text()').extract())


class Lesson2Spider(CrawlSpider):
    name = 'lesson1x'
    allowed_domains = ['www.reddit.com']
    start_urls = ['http://www.reddit.com/r/%s/' % subreddit]

    rules = [
        Rule(LinkExtractor(
            allow=['/r/%s/\?count=\d*&after=\w*' % subreddit, ]),
            callback='parse_item',
            follow=True),
    ]

    def parse_item(self, response):
        with open('redditscan8.csv', 'a', encoding='utf-8') as csvfile:
            joe = csv.writer(csvfile, dialect ='excel')

joe.writerow(response.xpath('//p[@class="title"]/a/text()').extract())

process = CrawlerProcess()
process.crawl(Lesson1Spider)
process.crawl(Lesson2Spider)
process.start()

是的。我同意。规则是基于页面底部的“下一步”按钮。我假设这就是为什么没有阅读起始页（因为它没有遵循规则。我想我需要在规则中添加一个“始终接受第一页，然后按照现有规则进行操作。您也可以

为下一页生成新请求，请参阅的“以下链接”部分。（如果此答案解决了您的问题，请接受此答案）感谢您的输入，我无法理解收益率的概念，但我确实提出了另一个解决方案，作为我原始问题的答案。请查看如何生成下一页。他们给出的示例：next_page=response.css（'li.next a:：attr（href）'）。extract_first（）如果next\u page不是None：next\u page=response.urljoin（next\u page）生成scrapy.Request（next\u page，callback=self.parse）