Python 从碎片管道中删除重复项目_Python_Scrapy_Web Crawler

Python 从碎片管道中删除重复项目

python scrapy web-crawler

Python 从碎片管道中删除重复项目,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我的scrapy爬虫从ptt网站收集数据，并使用gspread将爬虫数据输入谷歌电子表格。我的ptt蜘蛛每天都在ptt网站上解析最新的40篇文章，现在我想在这篇最新的40篇文章中删除重复数据，例如，如果文章标题或文章链接与昨天相同，那么就不需要将这篇文章解析到谷歌电子表格中。我知道我应该在scarpy中使用DropItem，但实际上我不知道如何修复我的代码（我是Python的新手），我想寻求这方面的帮助，谢谢这是我的ppt蜘蛛代码这是我的管道感谢sharmiko，我重写了它，但它似乎不起

我的scrapy爬虫从ptt网站收集数据，并使用gspread将爬虫数据输入谷歌电子表格。我的ptt蜘蛛每天都在ptt网站上解析最新的40篇文章，现在我想在这篇最新的40篇文章中删除重复数据，例如，如果文章标题或文章链接与昨天相同，那么就不需要将这篇文章解析到谷歌电子表格中。
我知道我应该在scarpy中使用DropItem，但实际上我不知道如何修复我的代码（我是Python的新手），我想寻求这方面的帮助，谢谢

这是我的ppt蜘蛛代码

这是我的管道

感谢sharmiko，我重写了它，但它似乎不起作用，我应该修复什么

这是导出到GoogleSheet的代码

您应该修改

process\u item

函数以检查重复的元素，如果找到，您可以直接删除它

from scrapy.exceptions import DropItem
...
def process_item(self, item, spider):
    if [ your duplicate check logic goes here]:
       raise DropItem('Duplicate element found')
    else:
       self.exporter.export_item(item)
       return item

丢弃的项目不再传递给其他管道组件。您可以阅读有关管道的更多信息。

我还编写了exporter.py来描述如何连接google sheet api，并将数据输入google sheet。我是否也应该修复exporter.py中的代码？因为我重写了process_item函数，但它似乎不起作用。你能分享谷歌工作表代码并粘贴完整的错误消息吗？嗨！我在帖子中添加了代码，请查看帖子~谢谢你对

#class DuplicatesTitlePipeline（对象）发表了评论：

你的代码上的这一行也是吗？或者只需键入stackoverflow？也不要忘记在每次添加新管道时更新

settings.py

列表中的

项管道。当然，我添加了#类DuplicatesTitlePipeline（对象）：
以尝试运行，还将此管道添加到settings.py，但它似乎不起作用，所以我给它一个#忽略代码。如果它不包含任何敏感信息，你能粘贴错误回溯吗？你是否运行了代码并添加了重复的元素？或者这是以前收集的列表。
from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):
    def open_spider(self, spider):
        self.exporter = GoogleSheetItemExporter()
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):

    def open_spider(self, spider):
        self.exporter = GoogleSheetItemExporter()
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()

#    def process_item(self, item, spider):
#        self.exporter.export_item(item)
#        return item

#class DuplicatesTitlePipeline(object):
    def __init__(self):
        self.article = set()
    def process_item(self, item, spider):
        href = item['href'] 
        if href in self.article:
            raise DropItem('duplicates href found %s', item)
        self.exporter.export_item(item)
        return(item)

import gspread
from oauth2client.service_account import ServiceAccountCredentials
from scrapy.exporters import BaseItemExporter

class GoogleSheetItemExporter(BaseItemExporter):
    def __init__(self):
        scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
        credentials = ServiceAccountCredentials.from_json_keyfile_name('pythonupload.json', scope)
        gc = gspread.authorize(credentials)
        self.spreadsheet = gc.open('Community')
        self.worksheet = self.spreadsheet.get_worksheet(1)

    def export_item(self, item):
        self.worksheet.append_row([item['push'], item['title'], 
        item['href'],item['date'],item['author']])

from scrapy.exceptions import DropItem
...
def process_item(self, item, spider):
    if [ your duplicate check logic goes here]:
       raise DropItem('Duplicate element found')
    else:
       self.exporter.export_item(item)
       return item