使用Spark的Python脚本中的Memoryleak_Python_Apache Spark_Ocr_Wand_Python Tesseract

使用Spark的Python脚本中的Memoryleak

python apache-spark

使用Spark的Python脚本中的Memoryleak,python,apache-spark,ocr,wand,python-tesseract,Python,Apache Spark,Ocr,Wand,Python Tesseract,我第一次开始使用Spark进行OCR任务，我有一个包含扫描文本文档的PDF文件文件夹，我想将其转换为纯文本。我首先创建文件夹中所有pdf的并行数据集，并执行映射操作来创建图像。我使用魔杖图像来完成这项任务。最后，使用foreach，我使用pytesseract进行OCR，它是Tesseract的包装器这种方法的问题是，随着每个新文档的出现，内存使用量不断增加，最后我得到一个错误“os无法分配内存”。我感觉它在内存中存储了完整的Img对象，但我所需要的只是临时文件位置的列表。如果我用几个PDF文

我第一次开始使用Spark进行OCR任务，我有一个包含扫描文本文档的PDF文件文件夹，我想将其转换为纯文本。我首先创建文件夹中所有pdf的并行数据集，并执行映射操作来创建图像。我使用魔杖图像来完成这项任务。最后，使用foreach，我使用pytesseract进行OCR，它是Tesseract的包装器

这种方法的问题是，随着每个新文档的出现，内存使用量不断增加，最后我得到一个错误“os无法分配内存”。我感觉它在内存中存储了完整的Img对象，但我所需要的只是临时文件位置的列表。如果我用几个PDF文件运行它，但超过5个文件系统崩溃

def toImage(f):
    documentName = f[:-4]

    def imageList(imgObject):       
        #get list of generated images
        imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)

        if len(img.sequence) > 1:   
            images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
        else:
            images = [("{}.jpg".format(imagePrefix), documentName)]
        return images

    #store images for each file in tmp directory
    with WandImage(filename=path + f, resolution=300) as img:
        #create tmp directory
        if not os.path.exists(path + "tmp/" +  documentName):
            os.makedirs(path + "tmp/" +  documentName)

        #save images in tmp directory
        img.format = 'jpeg'
        img.save(filename=path + "tmp/" +  documentName + '/' + documentName + '.jpg')  
        imageL =  imageList(img)
        return imageL


def doOcr(imageList):
    print(imageList[0][1])
    content = "\n\n***NEWPAGE***\n\n".join([pytesseract.image_to_string(Image.open(fullPath), lang='nld') for fullPath, documentName in imageList])
    with open(path + "/txt/" + imageList[0][1] + ".txt", "w") as text_file:
        text_file.write(content)

sc = SparkContext(appName="OCR")
pdfFiles = sc.parallelize([f for f in os.listdir(sys.argv[1]) if f.endswith(".pdf")])
text = pdfFiles.map(toImage).foreach(doOCr)

我使用的是具有8gb内存的Ubuntu Java 7和Python3.5 我找到了一个解决方案，问题似乎出现在我创建imagelist的部分，使用：

def imageList(imgObject):       
        #get list of generated images
        # imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)

        # if len(img.sequence) > 1: 
        #   images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
        # else:
        #   images = [("{}.jpg".format(imagePrefix), documentName)]

        fullPath = "{}tmp/{}/".format(path, documentName)
        images = [(fullPath + f, documentName) for f in os.listdir(fullPath) if f.endswith(".jpg")]

        return natsorted(images, key=lambda y: y[0])

很好用，但我不知道为什么。。所有内容都关闭了，但仍保留在内存中

是什么让您认为内存泄漏？如果您并行读取文件，您将看到以相同类型加载多个文件。另外，处理大型对象通常不是Spark的最佳用例，尤其是在资源受限的情况下。最后，如何设置

spark.python.worker.memory

？我已经在conf/spark-defaults.conf文件中设置了内存。我认为存在内存泄漏，因为我可以看到文件使用OCR部分完成，并且使用的内存没有减少。您使用的是什么版本的ImageMagick&Wand？近年来已经解决了几个内存泄漏问题。我一直在使用这两个问题的最新版本。我终于找到了一个解决方案，我将用这个解决方案更新我的问题/fix@Chrisp如果您已经回答了自己的问题，请不要更新问题，将解决方案作为答案发布，这将有助于未来的用户