为什么运行这个python脚本会占用我所有的磁盘空间？_Python_Python 3.x_Memory_Memory Management

为什么运行这个python脚本会占用我所有的磁盘空间？

python python-3.x memory memory-management

为什么运行这个python脚本会占用我所有的磁盘空间？,python,python-3.x,memory,memory-management,Python,Python 3.x,Memory,Memory Management,我正在运行一个python脚本，您可以在下面看到它以供参考。此脚本使用PyteSeract将从pdf获取的图像中的文本转换为json文件，其中包含字符串形式的文本以及页码等。但每次我运行此脚本时，一段时间后，我的磁盘存储空间会耗尽，只有在我重新启动计算机后才会释放。举个例子，我的电脑现在还剩下20GB，但在运行脚本一段时间后，磁盘就满了，我不知道为什么会发生这种情况。如果局部变量正在使用“del”，我尝试使用“del”释放空间，还尝试使用gc.collect（）强制释放空间，但没有任何效果。我做

我正在运行一个python脚本，您可以在下面看到它以供参考。此脚本使用PyteSeract将从pdf获取的图像中的文本转换为json文件，其中包含字符串形式的文本以及页码等。但每次我运行此脚本时，一段时间后，我的磁盘存储空间会耗尽，只有在我重新启动计算机后才会释放。举个例子，我的电脑现在还剩下20GB，但在运行脚本一段时间后，磁盘就满了，我不知道为什么会发生这种情况。如果局部变量正在使用“del”，我尝试使用“del”释放空间，还尝试使用gc.collect（）强制释放空间，但没有任何效果。我做错了什么，如何改进

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc
import json
import uuid
import gc

def generate_id(code):
    increment_no = str(uuid.uuid4().int)[5:12]
    _id = code + increment_no
    return _id

def pdf_to_json(pdf_path):
    """This function takes in the path of pdf to generate a json object with the following attributes"""
    """Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
    data = {}
    pdf=wi(filename=pdf_path,resolution=300)
    data['company'] = str(pdf_path.split('/')[-1:][0])
    countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
    data['id'] = generate_id(countrycode)
    pdfImg=pdf.convert('jpeg')
    del pdf
    gc.collect()
    imgBlobs=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        gc.collect()
        imgBlobs.append(page.make_blob('jpeg'))
        del page
        gc.collect()
    del pdfImg
    gc.collect()
    i=1
    Pages = []
    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        Pages.append(text)
        del text
        gc.collect()
        im.close()
        del im
        gc.collect()
    del imgBlobs
    gc.collect()
    data['Pages'] = Pages
    with open('/Users/rishabh/Desktop/CyberBoxer/hawaii_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)
    del data
    gc.collect()
    del Pages
    gc.collect()

from os import listdir
onlyfiles = [f for f in listdir('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/')]

j=1
for i in onlyfiles:
    if '.pdf' in i:
        start = time.time()
        pdf_path = '/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+i
        pdf_to_json(pdf_path)
        print(j)
        j+=1
        end = time.time()
        print(end-start)
        gc.collect()```

我明白了为什么会发生这种情况，因为python中的wand Image模块，我必须销毁通过'del'或gc.collect（）获得的对象，因为wand Image有自己的销毁方法

以下是更新后的函数：

    """This function takes in the path of pdf to generate a json object with the following attributes"""
    """Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
    data = {}
    #pdf=wi(filename=pdf_path,resolution=300)
    data['company'] = str(pdf_path.split('/')[-1:][0])
    countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
    data['id'] = generate_id(countrycode)
    #pdfImg=pdf.convert('jpeg')
    #del pdf
    #gc.collect()
    #imgBlobs=[]
    #for img in pdfImg.sequence:
    #    page=wi(image=img)
    #    gc.collect()
    #    imgBlobs.append(page.make_blob('jpeg'))
    #    del page
    #    gc.collect()    
    req_image = []
    with WI(filename=pdf_path, resolution=150) as image_jpeg:
        image_jpeg.compression_quality = 99
        image_jpeg = image_jpeg.convert('jpeg')

        for img in image_jpeg.sequence:
            with WI(image=img) as img_page:
                req_image.append(img_page.make_blob('jpeg'))
    image_jpeg.destroy()
    i=1
    Pages = []
    for imgBlob in req_image:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        Pages.append(text)
        im.close()
        del im
    data['Pages'] = Pages
    with open('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)```

我明白了为什么会发生这种情况，因为python中的wand Image模块，我必须销毁通过'del'或gc.collect（）获得的对象，因为wand Image有自己的销毁方法

以下是更新后的函数：

    """This function takes in the path of pdf to generate a json object with the following attributes"""
    """Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
    data = {}
    #pdf=wi(filename=pdf_path,resolution=300)
    data['company'] = str(pdf_path.split('/')[-1:][0])
    countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
    data['id'] = generate_id(countrycode)
    #pdfImg=pdf.convert('jpeg')
    #del pdf
    #gc.collect()
    #imgBlobs=[]
    #for img in pdfImg.sequence:
    #    page=wi(image=img)
    #    gc.collect()
    #    imgBlobs.append(page.make_blob('jpeg'))
    #    del page
    #    gc.collect()    
    req_image = []
    with WI(filename=pdf_path, resolution=150) as image_jpeg:
        image_jpeg.compression_quality = 99
        image_jpeg = image_jpeg.convert('jpeg')

        for img in image_jpeg.sequence:
            with WI(image=img) as img_page:
                req_image.append(img_page.make_blob('jpeg'))
    image_jpeg.destroy()
    i=1
    Pages = []
    for imgBlob in req_image:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        Pages.append(text)
        im.close()
        del im
    data['Pages'] = Pages
    with open('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)```