用python从PDF中提取图像而不重新采样?

用python从PDF中提取图像而不重新采样?,python,image,pdf,extract,pypdf,Python,Image,Pdf,Extract,Pypdf,如何以本机分辨率和格式从pdf文档中提取所有图像?(表示将tiff提取为tiff、将jpeg提取为jpeg等,且无需重新采样)。布局不重要,我不在乎源图像是否位于页面上 我使用的是Python2.7,但如果需要可以使用3.x。通常在PDF中,图像只是按原样存储。例如,插入一个JPG的PDF在中间有一个字节的范围,当提取时是一个有效的JPG文件。您可以使用它非常简单地从PDF中提取字节范围。我不久前写过这方面的文章,示例代码是:。Libpoppler附带了一个名为“pdfimages”的工具,它正

如何以本机分辨率和格式从pdf文档中提取所有图像?(表示将tiff提取为tiff、将jpeg提取为jpeg等,且无需重新采样)。布局不重要,我不在乎源图像是否位于页面上


我使用的是Python2.7,但如果需要可以使用3.x。

通常在PDF中,图像只是按原样存储。例如,插入一个JPG的PDF在中间有一个字节的范围,当提取时是一个有效的JPG文件。您可以使用它非常简单地从PDF中提取字节范围。我不久前写过这方面的文章,示例代码是:。

Libpoppler附带了一个名为“pdfimages”的工具,它正好可以做到这一点

(在ubuntu系统上,它位于poppler utils包中)

Windows二进制文件:

我安装在服务器上,然后通过
Popen运行命令行调用:

 #!/usr/bin/python

 import sys
 import os
 import subprocess
 import settings

 IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )

 def extract_images(pdf):
     output = 'temp.png'
     cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
     subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
这将为每个页面创建一个图像,并将其存储为temp-0.png、temp-1.png。。。。
如果您得到的pdf只有图像而没有文本,这只是“提取”。

在Python中,使用PyPDF2和Pillow库,它很简单:

import PyPDF2

from PIL import Image

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

在PyPDF2用于CCITTFaxDecode筛选器的Python中:

导入PyPDF2
导入结构
"""
链接:
PDF格式:http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT第4组:https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=项目
从pdf中提取图像:http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
在.net中提取使用CCITTFaxDecode编码的图像:http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF格式和标签:http://www.awaresystems.be/imaging/tiff/faq.html
"""
用于CCITT的def tiff头(宽度、高度、图像大小、CCITT组=4):

tiff_header_struct='我在PyPDFTK中将所有这些添加在一起

我自己的贡献是处理
/索引的
文件,例如:

for obj in xObject:
    if xObject[obj]['/Subtype'] == '/Image':
        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
        color_space = xObject[obj]['/ColorSpace']
        if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
            color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
        mode = img_modes[color_space]

        if xObject[obj]['/Filter'] == '/FlateDecode':
            data = xObject[obj].getData()
            img = Image.frombytes(mode, size, data)
            if color_space == '/Indexed':
                img.putpalette(lookup.getData())
                img = img.convert('RGB')
            img.save("{}{:04}.png".format(filename_prefix, i))
请注意,当找到
/索引的
文件时,不能仅将
/ColorSpace
与字符串进行比较,因为它是作为
数组对象
提供的。因此,我们必须检查数组并检索索引调色板(
代码中的lookup
),并将其设置在PIL图像对象中,否则它将保持未初始化状态(零),并且整个图像显示为黑色

我的第一反应是将它们保存为GIF(这是一种索引格式),但我的测试结果表明PNG更小,看起来也一样


我在用Foxit Reader PDF打印机打印到PDF时发现了这些类型的图像。

我从@sylvain的代码开始 有一些缺陷,比如getData的异常
NotImplementedError:unsupported filter/DCTDecode
,或者代码在某些页面中找不到图像,因为它们的层次比页面更深

这是我的代码:

import PyPDF2

from PIL import Image

import sys
from os import path
import warnings
warnings.filterwarnings("ignore")

number = 0

def recurse(page, xObject):
    global number

    xObject = xObject['/Resources']['/XObject'].getObject()

    for obj in xObject:

        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj]._data
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(imagename + ".png")
                number += 1
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(imagename + ".jpg", "wb")
                img.write(data)
                img.close()
                number += 1
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(imagename + ".jp2", "wb")
                img.write(data)
                img.close()
                number += 1
        else:
            recurse(page, xObject[obj])



try:
    _, filename, *pages = sys.argv
    *pages, = map(int, pages)
    abspath = path.abspath(filename)
except BaseException:
    print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
    sys.exit()


file = PyPDF2.PdfFileReader(open(filename, "rb"))

for p in pages:    
    page0 = file.getPage(p-1)
    recurse(p, page0)

print('%s extracted images'% number)
更简单的解决方案:

使用poppler-utils包。要安装它,请使用自制软件(自制软件特定于MacOS,但您可以在此处找到适用于Widows或Linux的poppler utils软件包:)。下面的第一行代码使用自制软件安装poppler utils。安装后,第二行(从命令行运行)从PDF文件中提取图像,并将其命名为“image*”。要在Python中运行此程序,请使用操作系统或子流程模块。第三行是使用os模块的代码,下面是一个带有子流程的示例(对于run()函数为python 3.5或更高版本)。更多信息请点击此处:

brew安装poppler

pdfimages file.pdf图像

import os
os.system('pdfimages file.pdf image')


经过一些搜索,我发现下面的脚本非常适合我的PDF。它只处理JPG,但它在处理我的未受保护的文件时效果很好。此外,is不需要任何外部库

恕我直言,剧本出自内德·巴奇尔德,而不是我。 Python3代码:从pdf中提取jpg。又快又脏

import sys

with open(sys.argv[1],"rb") as file:
    file.seek(0)
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend
导入系统 打开(sys.argv[1],“rb”)作为文件: file.seek(0) pdf=file.read() startmark=b“\xff\xd8” startfix=0 endmark=b“\xff\xd9” endfix=2 i=0 njpg=0 尽管如此: istream=pdf.find(b“stream”,i) 如果istream<0: 打破 istart=pdf.find(开始标记、istream、istream+20) 如果istart<0: i=i流+20 持续 iend=pdf.find(b“endstream”,istart) 如果iend<0: 引发异常(“未找到流的结尾!”) iend=pdf.find(endmark,iend-20) 如果iend<0: 引发异常(“未找到JPG结尾!”) istart+=startfix iend+=endfix 打印(“JPG%d从%d到%d”%(njpg、istart、iend)) jpg=pdf[istart:iend] 将open(“jpg%d.jpg”%njpg,“wb”)作为jpgfile: jpgfile.write(jpg) njpg+=1 i=iend
您可以使用PyMuPDF模块。这会将所有图像输出为.png文件,但它是开箱即用的,速度很快

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None
导入fitz
doc=fitz.open(“file.pdf”)
对于范围内的i(len(doc)):
对于文档getPageImageList(i)中的img:
外部参照=img[0]
pix=fitz.Pixmap(文档,外部参照)
如果pix.n<5:#这是灰色或RGB
pix.writePNG(“p%s-%s.png”%(i,外部参照))
其他:#CMYK:首先转换为RGB
pix1=fitz.Pixmap(fitz.csRGB,pix)
pix1.writePNG(“p%s-%s.png”%(i,外部参照))
pix1=无
pix=无

您也可以在Ubuntu中使用
pdfimages
命令

使用以下命令安装poppler库

sudo apt install poppler-utils

sudo apt-get install python-poppler

pdfimages file.pdf image
创建的文件列表是(例如,pdf中有两个图像)


它起作用了!现在您可以使用
子流程。运行
从python运行它。

我更喜欢minecart,因为它非常容易使用。下面的代码片段显示了如何从pdf中提取图像:

#pip install minecart
import minecart

pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)

page = doc.get_page(0) # getting a single page

#iterating through all pages
for page in doc.iter_pages():
    im = page.images[0].as_pil()  # requires pillow
    display(im)

截至2019年2月,@sylvain给出的解决方案(至少在我的设置中)如果没有一个小的修改就无法工作:
xObject[obj]['/Filte
image-000.png
image-001.png
#pip install minecart
import minecart

pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)

page = doc.get_page(0) # getting a single page

#iterating through all pages
for page in doc.iter_pages():
    im = page.images[0].as_pil()  # requires pillow
    display(im)
import PyPDF2, traceback

from PIL import Image

input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages

for i in range(nPages) :
    print i
    page0 = input1.getPage(i)
    try :
        xObject = page0['/Resources']['/XObject'].getObject()
    except : xObject = []

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            try :
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
                    mode = "CMYK"
                    # will cause errors when saving
                else:
                    mode = "P"

                fn = 'p%03d-%s' % (i + 1, obj[1:])
                print '\t', fn
                if '/FlateDecode' in xObject[obj]['/Filter'] :
                    img = Image.frombytes(mode, size, data)
                    img.save(fn + ".png")
                elif '/DCTDecode' in xObject[obj]['/Filter']:
                    img = open(fn + ".jpg", "wb")
                    img.write(data)
                    img.close()
                elif '/JPXDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".jp2", "wb")
                    img.write(data)
                    img.close()
                elif '/LZWDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".tif", "wb")
                    img.write(data)
                    img.close()
                else :
                    print 'Unknown format:', xObject[obj]['/Filter']
            except :
                traceback.print_exc()
#!/usr/bin/env python3
try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib


def get_color_mode(obj):

    try:
        cspace = obj['/ColorSpace']
    except KeyError:
        return None

    if cspace == '/DeviceRGB':
        return "RGB"
    elif cspace == '/DeviceCMYK':
        return "CMYK"
    elif cspace == '/DeviceGray':
        return "P"

    if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
        color_map = obj['/ColorSpace'][1].getObject()['/N']
        if color_map == 1:
            return "P"
        elif color_map == 3:
            return "RGB"
        elif color_map == 4:
            return "CMYK"


def get_object_images(x_obj):
    images = []
    for obj_name in x_obj:
        sub_obj = x_obj[obj_name]

        if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
            images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())

        elif sub_obj['/Subtype'] == '/Image':
            zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
            if zlib_compressed:
               sub_obj._data = zlib.decompress(sub_obj._data)

            images.append((
                get_color_mode(sub_obj),
                (sub_obj['/Width'], sub_obj['/Height']),
                sub_obj._data
            ))

    return images


def get_pdf_images(pdf_fp):
    images = []
    try:
        pdf_in = PdfFileReader(open(pdf_fp, "rb"))
    except:
        return images

    for p_n in range(pdf_in.numPages):

        page = pdf_in.getPage(p_n)

        try:
            page_x_obj = page['/Resources']['/XObject'].getObject()
        except KeyError:
            continue

        images += get_object_images(page_x_obj)

    return images


if __name__ == "__main__":

    pdf_fp = "test.pdf"

    for image in get_pdf_images(pdf_fp):
        (mode, size, data) = image
        try:
            img = Image.open(StringIO(data))
        except Exception as e:
            print ("Failed to read image with PIL: {}".format(e))
            continue
        # Do whatever you want with the image
import fitz
from PIL import Image
import io

filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)

#loads the first page
page = doc.loadPage(0)

#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]

#gets the image as a dict, check docs under extractImage 
baseImage = doc.extractImage(xref)

#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))

#Displays image for good measure
image.show()
apt-get install poppler-utils
pdfimages -all myfile.pdf ./images_found/
apt-get install jbig2dec
jbig2dec -t png -145.jb2g -145.jb2e
    import sys
    import PyPDF2
    from PIL import Image
    pdf=sys.argv[1]
    print(pdf)
    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    for x in range(0,input1.numPages):
        xObject=input1.getPage(x)
        xObject = xObject['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                print(size)
                data = xObject[obj]._data
                #print(data)
                print(xObject[obj]['/Filter'])
                if xObject[obj]['/Filter'][0] == '/DCTDecode':
                    img_name=str(x)+".jpg"
                    print(img_name)
                    img = open(img_name, "wb")
                    img.write(data)
                    img.close()
        print(str(x)+" is done")
file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
    for page in range(1, maxPages, 10):
        pages = convert_from_path(file_path, dpi=300, first_page=page, 
                last_page=min(page+10-1, maxPages))
        for page in pages:
            page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
            image_counter += 1
else:
    pages = convert_from_path(file_path, 300)
    for i, j in enumerate(pages):
        j.save(image_path+'/' + str(i) + '.png', 'PNG')
from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
        # Optional: print info about image
        w = raw_image.stream_dict.Width
        h = raw_image.stream_dict.Height
        f = raw_image.stream_dict.Filter
        size = raw_image.stream_dict.Length

        print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")
Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...