Python 爬虫程序在运行两次时会产生重复？_Python_Web Crawler_Scrapy

Python 爬虫程序在运行两次时会产生重复？

python web-crawler scrapy

Python 爬虫程序在运行两次时会产生重复？,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,我在python中使用爬虫框架“scrapy”，并使用pipelines.py文件将我的项目以json格式存储到一个文件中导入json class AYpiPipeline(object): def __init__(self): self.file = open("a11ypi_dict.json","ab+") # this method is called to process an item after it has been scraped. def process_it

我在python中使用爬虫框架“scrapy”，并使用pipelines.py文件将我的项目以json格式存储到一个文件中导入json

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

AYpiPipeline类（对象）：
定义初始化（自）：
self.file=open（“a11ypi_dict.json”，“ab+”）
#在刮取项目后，将调用此方法来处理该项目。
def过程_项目（自身、项目、蜘蛛）：
d={}
i=0
#在这里，我们将迭代这些已删除的项，并创建一个字典字典。
尝试：
当i时，您可以使用一些自定义中间件过滤掉重复项，例如。不过，要在您的spider中实际使用它，您还需要两件事：为项目分配ID的某种方式，以便过滤器能够识别重复项，以及在spider运行之间持久化访问的ID集的某种方式。第二个很简单——你可以使用像shelve这样的pythonic，或者你可以使用现在流行的许多键值商店中的一个。不过，第一部分会更难，这取决于您试图解决的问题
 您可以使用一些自定义中间件过滤掉重复项，例如。不过，要在您的spider中实际使用它，您还需要两件事：为项目分配ID的某种方式，以便过滤器能够识别重复项，以及在spider运行之间持久化访问的ID集的某种方式。第二个很简单——你可以使用像shelve这样的pythonic，或者你可以使用现在流行的许多键值商店中的一个。不过，第一部分会更难，这取决于您试图解决的问题 我认为，解决方案是防止脚本的多个实例同时运行。您可以为此使用文件锁（在脚本内部或外部，使用flock之类的实用程序）。多个爬虫实例的原因是什么？我认为，解决方案是防止脚本的多个实例同时运行。您可以为此使用文件锁（在脚本内部或外部，使用flock之类的实用程序）。多个爬虫实例的原因是什么？
import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())

    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file

             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .