C# 我想使用iTextSharp从PDF中获取除文本对象以外的所有对象作为图像_C#_Pdf_Itext

C# 我想使用iTextSharp从PDF中获取除文本对象以外的所有对象作为图像

c# pdf itext

C# 我想使用iTextSharp从PDF中获取除文本对象以外的所有对象作为图像,c#,pdf,itext,C#,Pdf,Itext,我正在开发一个程序，使用iTextSharp将PDF转换为PPTX，具体原因如下。到目前为止，我所做的是获取所有文本对象、图像对象和位置。但是如果没有文本，我觉得很难获得表格对象。事实上，如果我能把它们做成图片就更好了。我的计划是将除文本对象以外的所有对象合并为背景图像，并将文本对象放置在适当的位置。我试图在这里找到类似的问题，但到目前为止运气不佳。如果有人知道如何做这项工作，请回答。谢谢。尝试实现IRenderListener internal class ImageExtr

我正在开发一个程序，使用iTextSharp将PDF转换为PPTX，具体原因如下。到目前为止，我所做的是获取所有文本对象、图像对象和位置。但是如果没有文本，我觉得很难获得表格对象。事实上，如果我能把它们做成图片就更好了。我的计划是将除文本对象以外的所有对象合并为背景图像，并将文本对象放置在适当的位置。我试图在这里找到类似的问题，但到目前为止运气不佳。如果有人知道如何做这项工作，请回答。

谢谢。

尝试实现IRenderListener

  internal class ImageExtractor : IRenderListener
{
    private int _currentPage = 1;
    private int _imageCount = 0;
    private readonly string _outputFilePrefix;
    private readonly string _outputFolder;
    private readonly bool _overwriteExistingFiles;

    private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        _outputFilePrefix = outputFilePrefix;
        _outputFolder = outputFolder;
        _overwriteExistingFiles = overwriteExistingFiles;
    }

    /// <summary>
    /// Extract all images from a PDF file
    /// </summary>
    /// <param name="pdfPath">Full path and file name of PDF file</param>
    /// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
    /// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
    /// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
    /// <returns>Count of number of images extracted.</returns>
    public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        // Handle setting of any default values
        outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
        outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

        var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

        using (var pdfReader = new PdfReader(pdfPath))
        {
            if (pdfReader.IsEncrypted())
                throw new ApplicationException(pdfPath + " is encrypted.");

            var pdfParser = new PdfReaderContentParser(pdfReader);

            while (instance._currentPage <= pdfReader.NumberOfPages)
            {
                pdfParser.ProcessContent(instance._currentPage, instance);

                instance._currentPage++;
            }
        }

        return instance._imageCount;
    }

    #region Implementation of IRenderListener

    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderText(TextRenderInfo renderInfo) { }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        if (_imageCount == 0)
        {
            var imageObject = renderInfo.GetImage();

            var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
            var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



            if (_overwriteExistingFiles || !File.Exists(imagePath))
            {
                var imageRawBytes = imageObject.GetImageAsBytes();
                //create a new file ()
                File.WriteAllBytes(imagePath, imageRawBytes);

            }
        }
        _imageCount++;
    }

    #endregion // Implementation of IRenderListener

}

内部类ImageExtractor:IRenderListener
{
私有int_currentPage=1；
私有整数_imageCount=0；
私有只读字符串_outputFilePrefix；
私有只读字符串_outputFolder；
私有只读bool_覆盖现有文件；
专用图像提取器（字符串输出文件前缀、字符串输出文件夹、布尔覆盖现有文件）
{
_outputFilePrefix=outputFilePrefix；
_outputFolder=outputFolder；
_overwriteExistingFiles=overwriteExistingFiles；
}
/// 
///从PDF文件中提取所有图像
/// 
///PDF文件的完整路径和文件名
///导出文件的基本名称。如果为空，则使用与PDF文件相同的名称。
///保存图像的位置。如果为null或为空，则使用与PDF文件相同的文件夹。
///如果为True，则覆盖现有图像文件；如果为false，则跳过现有图像文件
///提取的图像数的计数。
公共静态int-ExtractImagesFromFile（字符串pdfPath、字符串outputFilePrefix、字符串outputFolder、布尔覆盖现有文件）
{
//处理任何默认值的设置
outputFilePrefix=outputFilePrefix？？System.IO.Path.GetFileNameWithoutExtension（pdfPath）；
outputFolder=String.IsNullOrEmpty（outputFolder）？System.IO.Path.GetDirectoryName（pdfPath）：outputFolder；
var实例=新的ImageExtractor（outputFilePrefix、outputFolder、overwriteExistingFiles）；
使用（var pdfReader=newpdfreader（pdfPath））
{
if（pdfReader.IsEncrypted（））
抛出新的ApplicationException（pdfPath+“已加密”）；
var pdfParser=新的PdfReaderContentParser（pdfReader）；
当你说的时候
到目前为止，我所做的是获取所有文本对象、图像对象和位置
但您没有详细说明如何实现，我假设您使用了匹配的IRenderListener
实现
但是，IRenderListener
，正如您自己发现的那样
只提取图像和文本
缺少的主要对象是路径及其用法
要提取它们，您也应该实现IExtRenderListener
，它扩展了IRenderListener
，但也检索有关路径的信息。要了解回调方法，请首先了解与路径相关的指令在PDF中的工作方式：

首先是关于构建实际路径的说明；这些说明本质上是

移动到某个位置
从上一个位置向某个位置添加一行
使用一些控制点将贝塞尔曲线从上一个位置添加到某个位置，或
使用一些宽度和高度信息在某个位置添加一个直立矩形

然后有一条可选指令将当前剪辑路径与生成的路径相交
最后，有一个用于填充路径内部和沿路径笔划的任何组合的绘图说明，即用于两种操作，一种或两种

这对应于在IExtRenderListener
实现中检索的回调：
/**
 * Called when the current path is being modified. E.g. new segment is being added,
 * new subpath is being started etc.
 *
 * @param renderInfo Contains information about the path segment being added to the current path.
 */
void ModifyPath(PathConstructionRenderInfo renderInfo);

调用一次或多次以构建实际路径，PathConstructionRenderInfo
在其操作
属性中包含实际指令类型（与PathConstructionRenderInfo
常量成员MOVETO
，LINETO
等比较以确定操作类型）以及其SegmentData
属性中所需的坐标/尺寸。Ctm
属性还返回当前设置为应用于所有绘图操作的仿射变换
然后
如果当前剪辑路径应与构造的路径相交，则调用
最后
/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo); 

被称为，PathPaintingRenderInfo
在其操作
属性中包含绘图操作（PathPaintingRenderInfo
常量

笔划和

填充

），用于确定其

规则

属性中“路径内部”的含义的规则（

非零绕组规则

或

奇偶规则

），以及

Ctm

，

LineWidth

，

LineCapStyle

，

linejointstyle

，

MiterLimit

，和

LineDashPattern

属性中的一些其他绘图细节。

是的，我已经尝试了IRenderListener。此方法仅提取图像和文本。它不返回任何关于表的信息。没有表相关函数。pdf中没有什么比表对象更好的了（除非它被正确地标记，即使它只是一个逻辑表对象，而不是图形对象），只有文本块（或任何你看到的表内容）可能还有一些图形对象，如线条或彩色矩形。因此，不清楚您想要什么。mkl，谢谢您的回答。希望我能在这个问题上再次得到您的帮助。我同意应该没有表格对象，但有趣的是，当我得到所有图片时，我看不到表格的图片。我使用了IRenderListener。期待实现

IExtRenderListener

，它扩展了

IRenderListener

，但对向量g有额外的回调

/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo);