用python从PDF中提取图像而不重新采样?
如何以本机分辨率和格式从pdf文档中提取所有图像?(表示将tiff提取为tiff、将jpeg提取为jpeg等,且无需重新采样)。布局不重要,我不在乎源图像是否位于页面上用python从PDF中提取图像而不重新采样?,python,image,pdf,extract,pypdf,Python,Image,Pdf,Extract,Pypdf,如何以本机分辨率和格式从pdf文档中提取所有图像?(表示将tiff提取为tiff、将jpeg提取为jpeg等,且无需重新采样)。布局不重要,我不在乎源图像是否位于页面上 我使用的是Python2.7,但如果需要可以使用3.x。通常在PDF中,图像只是按原样存储。例如,插入一个JPG的PDF在中间有一个字节的范围,当提取时是一个有效的JPG文件。您可以使用它非常简单地从PDF中提取字节范围。我不久前写过这方面的文章,示例代码是:。Libpoppler附带了一个名为“pdfimages”的工具,它正
我使用的是Python2.7,但如果需要可以使用3.x。通常在PDF中,图像只是按原样存储。例如,插入一个JPG的PDF在中间有一个字节的范围,当提取时是一个有效的JPG文件。您可以使用它非常简单地从PDF中提取字节范围。我不久前写过这方面的文章,示例代码是:。Libpoppler附带了一个名为“pdfimages”的工具,它正好可以做到这一点 (在ubuntu系统上,它位于poppler utils包中) Windows二进制文件:我安装在服务器上,然后通过
Popen运行命令行调用:
#!/usr/bin/python
import sys
import os
import subprocess
import settings
IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )
def extract_images(pdf):
output = 'temp.png'
cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
这将为每个页面创建一个图像,并将其存储为temp-0.png、temp-1.png。。。。
如果您得到的pdf只有图像而没有文本,这只是“提取”。在Python中,使用PyPDF2和Pillow库,它很简单:
import PyPDF2
from PIL import Image
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))
page0 = input1.getPage(0)
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
在PyPDF2用于CCITTFaxDecode筛选器的Python中:
导入PyPDF2
导入结构
"""
链接:
PDF格式:http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT第4组:https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=项目
从pdf中提取图像:http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
在.net中提取使用CCITTFaxDecode编码的图像:http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF格式和标签:http://www.awaresystems.be/imaging/tiff/faq.html
"""
用于CCITT的def tiff头(宽度、高度、图像大小、CCITT组=4):
tiff_header_struct='我在PyPDFTK中将所有这些添加在一起
我自己的贡献是处理/索引的文件,例如:
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
color_space = xObject[obj]['/ColorSpace']
if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
mode = img_modes[color_space]
if xObject[obj]['/Filter'] == '/FlateDecode':
data = xObject[obj].getData()
img = Image.frombytes(mode, size, data)
if color_space == '/Indexed':
img.putpalette(lookup.getData())
img = img.convert('RGB')
img.save("{}{:04}.png".format(filename_prefix, i))
请注意,当找到/索引的
文件时,不能仅将/ColorSpace
与字符串进行比较,因为它是作为数组对象
提供的。因此,我们必须检查数组并检索索引调色板(代码中的lookup
),并将其设置在PIL图像对象中,否则它将保持未初始化状态(零),并且整个图像显示为黑色
我的第一反应是将它们保存为GIF(这是一种索引格式),但我的测试结果表明PNG更小,看起来也一样
我在用Foxit Reader PDF打印机打印到PDF时发现了这些类型的图像。我从@sylvain的代码开始
有一些缺陷,比如getData的异常NotImplementedError:unsupported filter/DCTDecode
,或者代码在某些页面中找不到图像,因为它们的层次比页面更深
这是我的代码:
import PyPDF2
from PIL import Image
import sys
from os import path
import warnings
warnings.filterwarnings("ignore")
number = 0
def recurse(page, xObject):
global number
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj]._data
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(imagename + ".png")
number += 1
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(imagename + ".jpg", "wb")
img.write(data)
img.close()
number += 1
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(imagename + ".jp2", "wb")
img.write(data)
img.close()
number += 1
else:
recurse(page, xObject[obj])
try:
_, filename, *pages = sys.argv
*pages, = map(int, pages)
abspath = path.abspath(filename)
except BaseException:
print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
sys.exit()
file = PyPDF2.PdfFileReader(open(filename, "rb"))
for p in pages:
page0 = file.getPage(p-1)
recurse(p, page0)
print('%s extracted images'% number)
更简单的解决方案:
使用poppler-utils包。要安装它,请使用自制软件(自制软件特定于MacOS,但您可以在此处找到适用于Widows或Linux的poppler utils软件包:)。下面的第一行代码使用自制软件安装poppler utils。安装后,第二行(从命令行运行)从PDF文件中提取图像,并将其命名为“image*”。要在Python中运行此程序,请使用操作系统或子流程模块。第三行是使用os模块的代码,下面是一个带有子流程的示例(对于run()函数为python 3.5或更高版本)。更多信息请点击此处:
brew安装poppler
pdfimages file.pdf图像
import os
os.system('pdfimages file.pdf image')
或
经过一些搜索,我发现下面的脚本非常适合我的PDF。它只处理JPG,但它在处理我的未受保护的文件时效果很好。此外,is不需要任何外部库
恕我直言,剧本出自内德·巴奇尔德,而不是我。
Python3代码:从pdf中提取jpg。又快又脏
import sys
with open(sys.argv[1],"rb") as file:
file.seek(0)
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
导入系统
打开(sys.argv[1],“rb”)作为文件:
file.seek(0)
pdf=file.read()
startmark=b“\xff\xd8”
startfix=0
endmark=b“\xff\xd9”
endfix=2
i=0
njpg=0
尽管如此:
istream=pdf.find(b“stream”,i)
如果istream<0:
打破
istart=pdf.find(开始标记、istream、istream+20)
如果istart<0:
i=i流+20
持续
iend=pdf.find(b“endstream”,istart)
如果iend<0:
引发异常(“未找到流的结尾!”)
iend=pdf.find(endmark,iend-20)
如果iend<0:
引发异常(“未找到JPG结尾!”)
istart+=startfix
iend+=endfix
打印(“JPG%d从%d到%d”%(njpg、istart、iend))
jpg=pdf[istart:iend]
将open(“jpg%d.jpg”%njpg,“wb”)作为jpgfile:
jpgfile.write(jpg)
njpg+=1
i=iend
您可以使用PyMuPDF模块。这会将所有图像输出为.png文件,但它是开箱即用的,速度很快
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
导入fitz
doc=fitz.open(“file.pdf”)
对于范围内的i(len(doc)):
对于文档getPageImageList(i)中的img:
外部参照=img[0]
pix=fitz.Pixmap(文档,外部参照)
如果pix.n<5:#这是灰色或RGB
pix.writePNG(“p%s-%s.png”%(i,外部参照))
其他:#CMYK:首先转换为RGB
pix1=fitz.Pixmap(fitz.csRGB,pix)
pix1.writePNG(“p%s-%s.png”%(i,外部参照))
pix1=无
pix=无
您也可以在Ubuntu中使用pdfimages
命令
使用以下命令安装poppler库
sudo apt install poppler-utils
sudo apt-get install python-poppler
pdfimages file.pdf image
创建的文件列表是(例如,pdf中有两个图像)
它起作用了!现在您可以使用子流程。运行从python运行它。我更喜欢minecart,因为它非常容易使用。下面的代码片段显示了如何从pdf中提取图像:
#pip install minecart
import minecart
pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(0) # getting a single page
#iterating through all pages
for page in doc.iter_pages():
im = page.images[0].as_pil() # requires pillow
display(im)
截至2019年2月,@sylvain给出的解决方案(至少在我的设置中)如果没有一个小的修改就无法工作:xObject[obj]['/Filte
image-000.png
image-001.png
#pip install minecart
import minecart
pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(0) # getting a single page
#iterating through all pages
for page in doc.iter_pages():
im = page.images[0].as_pil() # requires pillow
display(im)
import PyPDF2, traceback
from PIL import Image
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages
for i in range(nPages) :
print i
page0 = input1.getPage(i)
try :
xObject = page0['/Resources']['/XObject'].getObject()
except : xObject = []
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
try :
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
mode = "CMYK"
# will cause errors when saving
else:
mode = "P"
fn = 'p%03d-%s' % (i + 1, obj[1:])
print '\t', fn
if '/FlateDecode' in xObject[obj]['/Filter'] :
img = Image.frombytes(mode, size, data)
img.save(fn + ".png")
elif '/DCTDecode' in xObject[obj]['/Filter']:
img = open(fn + ".jpg", "wb")
img.write(data)
img.close()
elif '/JPXDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".jp2", "wb")
img.write(data)
img.close()
elif '/LZWDecode' in xObject[obj]['/Filter'] :
img = open(fn + ".tif", "wb")
img.write(data)
img.close()
else :
print 'Unknown format:', xObject[obj]['/Filter']
except :
traceback.print_exc()
#!/usr/bin/env python3
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib
def get_color_mode(obj):
try:
cspace = obj['/ColorSpace']
except KeyError:
return None
if cspace == '/DeviceRGB':
return "RGB"
elif cspace == '/DeviceCMYK':
return "CMYK"
elif cspace == '/DeviceGray':
return "P"
if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
color_map = obj['/ColorSpace'][1].getObject()['/N']
if color_map == 1:
return "P"
elif color_map == 3:
return "RGB"
elif color_map == 4:
return "CMYK"
def get_object_images(x_obj):
images = []
for obj_name in x_obj:
sub_obj = x_obj[obj_name]
if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())
elif sub_obj['/Subtype'] == '/Image':
zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
if zlib_compressed:
sub_obj._data = zlib.decompress(sub_obj._data)
images.append((
get_color_mode(sub_obj),
(sub_obj['/Width'], sub_obj['/Height']),
sub_obj._data
))
return images
def get_pdf_images(pdf_fp):
images = []
try:
pdf_in = PdfFileReader(open(pdf_fp, "rb"))
except:
return images
for p_n in range(pdf_in.numPages):
page = pdf_in.getPage(p_n)
try:
page_x_obj = page['/Resources']['/XObject'].getObject()
except KeyError:
continue
images += get_object_images(page_x_obj)
return images
if __name__ == "__main__":
pdf_fp = "test.pdf"
for image in get_pdf_images(pdf_fp):
(mode, size, data) = image
try:
img = Image.open(StringIO(data))
except Exception as e:
print ("Failed to read image with PIL: {}".format(e))
continue
# Do whatever you want with the image
import fitz
from PIL import Image
import io
filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)
#loads the first page
page = doc.loadPage(0)
#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]
#gets the image as a dict, check docs under extractImage
baseImage = doc.extractImage(xref)
#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))
#Displays image for good measure
image.show()
apt-get install poppler-utils
pdfimages -all myfile.pdf ./images_found/
apt-get install jbig2dec
jbig2dec -t png -145.jb2g -145.jb2e
import sys
import PyPDF2
from PIL import Image
pdf=sys.argv[1]
print(pdf)
input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
for x in range(0,input1.numPages):
xObject=input1.getPage(x)
xObject = xObject['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
print(size)
data = xObject[obj]._data
#print(data)
print(xObject[obj]['/Filter'])
if xObject[obj]['/Filter'][0] == '/DCTDecode':
img_name=str(x)+".jpg"
print(img_name)
img = open(img_name, "wb")
img.write(data)
img.close()
print(str(x)+" is done")
file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
for page in range(1, maxPages, 10):
pages = convert_from_path(file_path, dpi=300, first_page=page,
last_page=min(page+10-1, maxPages))
for page in pages:
page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
image_counter += 1
else:
pages = convert_from_path(file_path, 300)
for i, j in enumerate(pages):
j.save(image_path+'/' + str(i) + '.png', 'PNG')
from pikepdf import Pdf, PdfImage
filename = "sample-in.pdf"
example = Pdf.open(filename)
for i, page in enumerate(example.pages):
for j, (name, raw_image) in enumerate(page.images.items()):
image = PdfImage(raw_image)
out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
# Optional: print info about image
w = raw_image.stream_dict.Width
h = raw_image.stream_dict.Height
f = raw_image.stream_dict.Filter
size = raw_image.stream_dict.Length
print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")
Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...