为什么运行这个python脚本会占用我所有的磁盘空间?
我正在运行一个python脚本,您可以在下面看到它以供参考。此脚本使用PyteSeract将从pdf获取的图像中的文本转换为json文件,其中包含字符串形式的文本以及页码等。但每次我运行此脚本时,一段时间后,我的磁盘存储空间会耗尽,只有在我重新启动计算机后才会释放。举个例子,我的电脑现在还剩下20GB,但在运行脚本一段时间后,磁盘就满了,我不知道为什么会发生这种情况。如果局部变量正在使用“del”,我尝试使用“del”释放空间,还尝试使用gc.collect()强制释放空间,但没有任何效果。我做错了什么,如何改进为什么运行这个python脚本会占用我所有的磁盘空间?,python,python-3.x,memory,memory-management,Python,Python 3.x,Memory,Memory Management,我正在运行一个python脚本,您可以在下面看到它以供参考。此脚本使用PyteSeract将从pdf获取的图像中的文本转换为json文件,其中包含字符串形式的文本以及页码等。但每次我运行此脚本时,一段时间后,我的磁盘存储空间会耗尽,只有在我重新启动计算机后才会释放。举个例子,我的电脑现在还剩下20GB,但在运行脚本一段时间后,磁盘就满了,我不知道为什么会发生这种情况。如果局部变量正在使用“del”,我尝试使用“del”释放空间,还尝试使用gc.collect()强制释放空间,但没有任何效果。我做
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc
import json
import uuid
import gc
def generate_id(code):
increment_no = str(uuid.uuid4().int)[5:12]
_id = code + increment_no
return _id
def pdf_to_json(pdf_path):
"""This function takes in the path of pdf to generate a json object with the following attributes"""
"""Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
data = {}
pdf=wi(filename=pdf_path,resolution=300)
data['company'] = str(pdf_path.split('/')[-1:][0])
countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
data['id'] = generate_id(countrycode)
pdfImg=pdf.convert('jpeg')
del pdf
gc.collect()
imgBlobs=[]
for img in pdfImg.sequence:
page=wi(image=img)
gc.collect()
imgBlobs.append(page.make_blob('jpeg'))
del page
gc.collect()
del pdfImg
gc.collect()
i=1
Pages = []
for imgBlob in imgBlobs:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
Pages.append(text)
del text
gc.collect()
im.close()
del im
gc.collect()
del imgBlobs
gc.collect()
data['Pages'] = Pages
with open('/Users/rishabh/Desktop/CyberBoxer/hawaii_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
del data
gc.collect()
del Pages
gc.collect()
from os import listdir
onlyfiles = [f for f in listdir('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/')]
j=1
for i in onlyfiles:
if '.pdf' in i:
start = time.time()
pdf_path = '/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+i
pdf_to_json(pdf_path)
print(j)
j+=1
end = time.time()
print(end-start)
gc.collect()```
我明白了为什么会发生这种情况,因为python中的wand Image模块,我必须销毁通过'del'或gc.collect()获得的对象,因为wand Image有自己的销毁方法 以下是更新后的函数:
"""This function takes in the path of pdf to generate a json object with the following attributes"""
"""Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
data = {}
#pdf=wi(filename=pdf_path,resolution=300)
data['company'] = str(pdf_path.split('/')[-1:][0])
countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
data['id'] = generate_id(countrycode)
#pdfImg=pdf.convert('jpeg')
#del pdf
#gc.collect()
#imgBlobs=[]
#for img in pdfImg.sequence:
# page=wi(image=img)
# gc.collect()
# imgBlobs.append(page.make_blob('jpeg'))
# del page
# gc.collect()
req_image = []
with WI(filename=pdf_path, resolution=150) as image_jpeg:
image_jpeg.compression_quality = 99
image_jpeg = image_jpeg.convert('jpeg')
for img in image_jpeg.sequence:
with WI(image=img) as img_page:
req_image.append(img_page.make_blob('jpeg'))
image_jpeg.destroy()
i=1
Pages = []
for imgBlob in req_image:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
Pages.append(text)
im.close()
del im
data['Pages'] = Pages
with open('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)```
我明白了为什么会发生这种情况,因为python中的wand Image模块,我必须销毁通过'del'或gc.collect()获得的对象,因为wand Image有自己的销毁方法 以下是更新后的函数:
"""This function takes in the path of pdf to generate a json object with the following attributes"""
"""Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
data = {}
#pdf=wi(filename=pdf_path,resolution=300)
data['company'] = str(pdf_path.split('/')[-1:][0])
countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
data['id'] = generate_id(countrycode)
#pdfImg=pdf.convert('jpeg')
#del pdf
#gc.collect()
#imgBlobs=[]
#for img in pdfImg.sequence:
# page=wi(image=img)
# gc.collect()
# imgBlobs.append(page.make_blob('jpeg'))
# del page
# gc.collect()
req_image = []
with WI(filename=pdf_path, resolution=150) as image_jpeg:
image_jpeg.compression_quality = 99
image_jpeg = image_jpeg.convert('jpeg')
for img in image_jpeg.sequence:
with WI(image=img) as img_page:
req_image.append(img_page.make_blob('jpeg'))
image_jpeg.destroy()
i=1
Pages = []
for imgBlob in req_image:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
Pages.append(text)
im.close()
del im
data['Pages'] = Pages
with open('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)```