Scrapy 一致性数据_Scrapy_Web Crawler_Repeat

Scrapy 一致性数据

scrapy web-crawler

Scrapy 一致性数据,scrapy,web-crawler,repeat,Scrapy,Web Crawler,Repeat,这是我的蜘蛛代码。我的问题是，当我使用带json的爬网时。网站上有这么多重复的数据你的爬虫看起来不错。你能更详细地说明“重复数据”是什么意思吗？和重复的评论数据一样？你是在从TripAdvisor中删除书中的引用吗？面向对象 import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'https://www.tripadvisor.com/Restaurant

这是我的蜘蛛代码。我的问题是，当我使用带json的爬网时。网站上有这么多重复的数据

你的爬虫看起来不错。你能更详细地说明“重复数据”是什么意思吗？和重复的评论数据一样？你是在从TripAdvisor中删除书中的引用吗？面向对象

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.tripadvisor.com/Restaurant_Review-g298006-d740275-Reviews-Deniz_Restaurant-Izmir_Izmir_Province_Turkish_Aegean_Coast.html',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': response.xpath("//div[contains(@class, 'member_info')]//div/text()").extract(), 
                'rating': response.xpath("//span[contains(@class,'ui_bubble_rating')]/@alt").extract() ,
                'comment_tag': response.xpath("//span[contains(@class, 'noQuotes')]/text()").extract(),
                'comment': response.xpath('//div[@class="entry"]/p/text()').extract()
            }

        next_page = response.xpath("//div[contains(@class, 'unified')]/a[contains(@class, 'next')]/@href").extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)