Java 使用PDFBox比较两个PDF文件文本失败，即使两个文件具有相同的文本_Java_Pdfbox

Java 使用PDFBox比较两个PDF文件文本失败，即使两个文件具有相同的文本

java

Java 使用PDFBox比较两个PDF文件文本失败，即使两个文件具有相同的文本,java,pdfbox,Java,Pdfbox,我在selenium automation中使用PDFBOX作为导出测试的实用工具。我们使用pdfbox将实际导出的pdf文件与预期的pdf文件进行比较，然后相应地通过/失败测试。这工作相当顺利。然而，最近我遇到了实际导出的文件，它看起来和预期的文件一样（就数据而言），但是当与pdfbox进行比较时，它失败了下面是我用来比较pdf文件的通用工具 private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) th

我在selenium automation中使用PDFBOX作为导出测试的实用工具。我们使用pdfbox将实际导出的pdf文件与预期的pdf文件进行比较，然后相应地通过/失败测试。这工作相当顺利。然而，最近我遇到了实际导出的文件，它看起来和预期的文件一样（就数据而言），但是当与pdfbox进行比较时，它失败了

下面是我用来比较pdf文件的通用工具

    private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
    LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
    PDDocument pdf1 = PDDocument.load(pdfFile1);
    PDDocument pdf2 = PDDocument.load(pdfFile2);
    PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
    PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
    try
    {
        if (pdf1pages.getCount() != pdf2pages.getCount())
        {
            String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
            LOG.debug(message);
            throw new TestException(message);
        }
        PDFTextStripper pdfStripper = new PDFTextStripper();
        LOG.debug("pdfStripper is :- " + pdfStripper);
        LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
        for (int i = 0; i < pdf1pages.getCount(); i++)
        {
            pdfStripper.setStartPage(i + 1);
            pdfStripper.setEndPage(i + 1);
            String pdf1PageText = pdfStripper.getText(pdf1);
            String pdf2PageText = pdfStripper.getText(pdf2);
            if (!pdf1PageText.equals(pdf2PageText))
            {
                String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
                LOG.debug(message);
                System.out.println("fff");
                LOG.debug("pdf1PageText is " + pdf1PageText);
                LOG.debug("pdf2PageText is " + pdf2PageText);
                String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
                LOG.debug("difference is "+difference);
                throw new TestException(message+" [[ Difference is ]] "+difference);
            }
        }
        LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
    } finally {
        pdf1.close();
        pdf2.close();
    }
}

private static void arepdfilesequal（文件pdfFile1、文件pdfFile2）引发IOException
{
LOG.info（“比较PDF文件（“+Pdfile1+”，“+Pdfile2+”））；
PDDocument pdf1=PDDocument.load（pdfFile1）；
PDDocument pdf2=PDDocument.load（pdfFile2）；
PDPageTree pdf1pages=pdf1.getDocumentCatalog（）.getPages（）；
PDPageTree pdf2pages=pdf2.getDocumentCatalog（）.getPages（）；
尝试
{
如果（pdf1pages.getCount（）！=pdf2pages.getCount（））
{
String message=“文件中的页数（“+Pdfile1+”、“+Pdfile2+”）不匹配。Pdfile1有“+pdf1pages.getCount（）+”无页，而pdf2pages有“+pdf2pages.getCount（）+”无页”；
LOG.debug（消息）；
抛出新的TestException（消息）；
}
PDFTextStripper pdfStripper=新的PDFTextStripper（）；
LOG.debug（“pdfStripper为：-”+pdfStripper）；
LOG.debug（“pdf1pages.size（）为：-“+pdf1pages.getCount（））；
对于（int i=0；i


Eclipse在控制台中显示了这种差异

我可以看出它是失败的，因为像（花括号，{}，哈希#，感叹号！）之类的符号，但是我不知道如何修复这个
谁能告诉我怎么修这个吗
然而，最近我遇到了实际导出的文件，它看起来和预期的文件一样（就数据而言），但是当与pdfbox进行比较时，它失败了
这可能会发生，这不应该让你感到惊讶。毕竟，您的测试不会比较相关页面的外观，而是比较文本提取的结果
虽然页面上文本数据的外观取决于相应（如果是您的文件）嵌入字体文件中相关字形的绘图说明，页面上相同文本数据的文本提取结果取决于该字体文件的PDF字体信息结构的ToUnicode表或编码值
事实上，虽然预期文档和实际文档的文本数据使用了各自字体的相同字形，但预期文档和实际文档中针对一种字体的ToUnicode表声称某些字形表示不同的Unicode代码点
所讨论的字体有以下三个标志符号：

预期文档中该字体的ToUnicode映射包含映射
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ] 

<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ] 


[  ] 

这三个字符分别对应于U+0000、U+F125和U+F128
实际文档中该字体的ToUnicode映射包含映射
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ] 

<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ] 


[  ] 

这三个字符分别对应于U+0000、U+F126和U+F129
因此，您的测试正确地发现了预期文档和实际文档之间的差异，因此其失败结果是正确的。因此，您不必修复任何问题，生成实际文档的软件有问题
（有人可能会说这些差异在Unicode专用区域内，并不重要。在这种情况下，您必须更新测试以忽略Unicode专用区域中字符的差异。但在您开始创建测试之前，应该告诉您这些差异。）
这是一个困难的问题，因为相似或甚至相同的Unicode字符可能具有不同的字节表示形式，这取决于PDF生成过程中的字体、编码和其他因素
如果您可以安全地假设相关文本段由8位字符表示，我可以想到一个可能的解决方案：
String stripUnicode(String s) {
    StringBuilder sb = new StringBuilder(s.length());
    for (char c : s.toCharArray()) {
        if (c <= 0xFF) {
            sb.append(c);
        }
    }
    return sb.toString();
}

...

String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...

String stripUnicode（字符串s）{
StringBuilder sb=新的StringBuilder（s.length（））；
for（char c:s.toCharArray（））{
如果（c）感谢您提供了如此有用的信息。我看到我的pdf中使用的符号的unicode是不同的。顺便说一句，我不知道pdf中包含的ToUnicode表/映射，感谢您让我知道这一点。Moerover您能告诉我您是如何发现unicode差异的吗？“您能告诉我您是如何发现unicode差异的吗？”-我不仅记录了不同的文本输出，我还以足够好的编码（UTF-8，但任何完整的unicode编码都可以）将它们写入文件并进行比较