Itext 段落内图像的提取_Itext

Itext 段落内图像的提取

itext

Itext 段落内图像的提取,itext,Itext,我正在构建一个应用程序，其中我需要解析一个由系统生成的pdf，并使用解析后的信息填充我的应用程序数据库列，但不幸的是，我正在处理的pdf结构有一个名为comments的列，该列包含文本和图像。我找到了从pdf中分别读取图像和文本的方法，但我的最终目标是在解析内容中的图像位置添加一个类似{2}的占位符，并且每当我的解析器（应用程序代码）出现时解析这一行，系统将在该区域中呈现适当的图像，该图像也存储在我的应用程序中的一个单独的表中。请帮我解决这个问题提前感谢。如评论中所述，解决方案基本上是使用自

我正在构建一个应用程序，其中我需要解析一个由系统生成的pdf，并使用解析后的信息填充我的应用程序数据库列，但不幸的是，我正在处理的pdf结构有一个名为comments的列，该列包含文本和图像。我找到了从pdf中分别读取图像和文本的方法，但我的最终目标是在解析内容中的图像位置添加一个类似{2}的占位符，并且每当我的解析器（应用程序代码）出现时解析这一行，系统将在该区域中呈现适当的图像，该图像也存储在我的应用程序中的一个单独的表中。请帮我解决这个问题

提前感谢。

如评论中所述，解决方案基本上是使用自定义文本提取策略在图像坐标处插入“[2]”文本块

代码例如，您可以这样扩展

LocationTextExtractionStrategy

：

class SimpleMixedExtractionStrategy extends LocationTextExtractionStrategy
{
    SimpleMixedExtractionStrategy(File outputPath, String name)
    {
        this.outputPath = outputPath;
        this.name = name;
    }

    @Override
    public void renderImage(final ImageRenderInfo renderInfo)
    {
        try
        {
            PdfImageObject image = renderInfo.getImage();
            if (image == null) return;
            int number = counter++;
            final String filename = String.format("%s-%s.%s", name, number, image.getFileType());
            Files.write(new File(outputPath, filename).toPath(), image.getImageAsBytes());

            LineSegment segment = UNIT_LINE.transformBy(renderInfo.getImageCTM());
            TextChunk location = new TextChunk("[" + filename + "]", segment.getStartPoint(), segment.getEndPoint(), 0f);

            Field field = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            field.setAccessible(true);
            List<TextChunk> locationalResult = (List<TextChunk>) field.get(this);
            locationalResult.add(location);
        }
        catch (IOException | NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException ioe)
        {
            ioe.printStackTrace();
        }
    }

    final File outputPath;
    final String name; 
    int counter = 0;

    final static LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1) , new Vector(1, 0, 1));
}

@Test
public void testSimpleMixedExtraction() throws IOException
{
    InputStream resourceStream = getClass().getResourceAsStream("book-of-vaadin-page14.pdf");
    try
    {
        PdfReader reader = new PdfReader(resourceStream);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(OUTPUT_PATH, "book-of-vaadin-page14");
        parser.processContent(1, listener);
        Files.write(new File(OUTPUT_PATH, "book-of-vaadin-page14.txt").toPath(), listener.getResultantText().getBytes());
    }
    finally
    {
        if (resourceStream != null)
            resourceStream.close();
    }
}

例如，对于我的测试文件（包含《瓦丁书》第14页）：

你得到这个文本了吗

Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help   Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window   Preferences   Install/Update
  Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help   Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse

和两张图片book-of-vaadin-page14-0.png

和book-of-vaadin-page14-1.png

在

输出路径中

改进正如在评论中已经提到的，此解决方案适用于图像上下都有文本但既不左也不右的简单情况

如果左侧和/或右侧也有文本，则问题在于上面的代码将

LineSegment

计算为图像的底线，但文本策略通常与位于底线之上的文本基线一起工作

但在这种情况下，我们首先必须决定文本中标记的位置。既然决定了这一点，就可以修改上面的源代码。

因为您没有显示代码，所以很难说您需要修改什么。基本上使用自定义的文本提取策略在图像坐标处插入“[2]”文本块。@mkl很抱歉，我们尚未开始执行代码，但我们仍在分析是否可以使用itext执行此操作。正如你所说，我经历了文本提取策略我的需要是这样的评论部分将像“图形区域覆盖325公里…”其中将包含pdf中的图像，因此使用此文本提取策略，我是否可以这样做“图形区域覆盖325公里{2}…”其中2将指向存储图像的唯一区域（简单地说是数据库或文件系统）。这听起来像是通过一些额外的编程（编写渲染接口的子类）就可以实现的。@BrunoLowagie我也这么认为。只需注意将图像与周围文本的基线正确关联（如果图像是按直线绘制的）。但是，如果文字比图像比文字更重要，这应该很容易。是的，在插入（X）时决定要考虑哪个坐标将是一个设计决策。一个可以使用底部y坐标，顶部y坐标，中间的东西…这取决于实现应用程序的人员，基于图像的性质。感谢您的回答，这是我想要的应用程序，我尝试了这段代码，但您能告诉我在上述解决方案中，单位线是什么意思吗？我以为这是一种方法，但在itext库中找不到它？单位线是什么意思-哦，对不起，忘了抄了。这是从（0,0）到（0,1）的常数行，

SimpleMixedExtractionStrategy

中的常数。我马上编辑我的答案。