Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用iText 7(或其他)从Java中的XFA PDF文档中提取XML?_Java_Xml_Pdf_Itext - Fatal编程技术网

如何使用iText 7(或其他)从Java中的XFA PDF文档中提取XML?

如何使用iText 7(或其他)从Java中的XFA PDF文档中提取XML?,java,xml,pdf,itext,Java,Xml,Pdf,Itext,使用Java和iText 7,我试图从XFA PDF表单中提取XML数据,以便解析(并可能修改)数据,但我所能做的就是获取一些与我使用的任何XFA文件相同的基本通用数据 我知道这必须是可能的,因为它是在iText RUPS工具中完成的,但我已经兜圈子好几天了 public class Parse { private PdfDocument pdf; private PdfAcroForm form; private XfaForm xfa; private Do

使用Java和iText 7,我试图从XFA PDF表单中提取XML数据,以便解析(并可能修改)数据,但我所能做的就是获取一些与我使用的任何XFA文件相同的基本通用数据

我知道这必须是可能的,因为它是在iText RUPS工具中完成的,但我已经兜圈子好几天了

public class Parse {

    private PdfDocument pdf;
    private PdfAcroForm form;
    private XfaForm xfa;
    private Document domDocument;
    private Map<Integer, String> data;
    private int numberOfPages;
    private String pdfText;

    public void openPdf(String src, String dest) throws IOException, TransformerException {

        PdfReader reader = new PdfReader(src);
        reader.setUnethicalReading(true);
        pdf = new PdfDocument(reader, new PdfWriter(dest));
        form = PdfAcroForm.getAcroForm(pdf, true);

        data = new HashMap<Integer, String>();
        numberOfPages = getNumberOfPdfPages();
        PdfPage currentPage;
        String textFromPage;

        for (int page = 1; page <= numberOfPages; page++) {
            System.out.println("Reading page: " + page + " -----------------");
            currentPage = pdf.getPage(page);
            textFromPage = PdfTextExtractor.getTextFromPage(currentPage);
            data.put(page, textFromPage);
            pdfText += currentPage + ":" + "\n" + textFromPage + "\n";
        }


        xfa = form.getXfaForm();
        domDocument = xfa.getDomDocument();
        Map<String, Node> map = xfa.extractXFANodes(domDocument);

        System.out.println("The template node = " + map.get("template").toString() + "\n");
        System.out.println("Dom document = " + domDocument.toString() + "\n");
        System.out.println("In map form = " + map.toString() + "\n");   
        System.out.println("pdfText = " + pdfText + "\n");

        Node node = xfa.getDatasetsNode();
        NodeList list = node.getChildNodes();

        for (int i = 0; i < list.getLength(); i++) {
            System.out.println("Get Child Nodes Output = " + list.item(i) + "\n");
        }

    }
}

您有一个纯XFA文件。这意味着存储在此文件中的唯一PDF内容包含“请稍候…”消息。该页面显示在不知道如何呈现XFA的PDF查看器中

它也是您使用以下方法从页面中提取内容时获得的内容:

currentPage = pdf.getPage(page);
textFromPage = PdfTextExtractor.getTextFromPage(currentPage);
这是在面对纯XFA文件时不应该做的事情,因为所有相关内容都存储在PDF文件中存储的XML流中

您已经拥有第一部分的权利:

xfa = form.getXfaForm();
domDocument = xfa.getDomDocument();
XFA流可以在
/AcroForm
条目中找到。我知道这很尴尬,但PDF就是这样设计的。这不是我们的选择,而且XFA在PDF 2.0中不受欢迎,所以XFA正在消亡。当XFA最终死亡并被埋葬时,问题将消失

这就是说,您有一个
org.w3c.dom.Document的实例
,您希望获得存储在此对象中的XML文件。这样做不需要iText。例如,这在

我使用以下代码片段在XFA文件上测试了该代码:

public static void main(String[] args) throws IOException, TransformerException {
    PdfDocument pdf = new PdfDocument(new PdfReader(SRC));
    PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
    XfaForm xfa = form.getXfaForm();
    Document doc = xfa.getDomDocument();
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.transform(domSource, result);
    writer.flush();
    System.out.println(writer.toString());
}
屏幕上的输出是包含我所期望的所有XFA信息的XDP XML文件


请注意,在替换XFA XML文件时,我会非常小心。最好不要干预XFA结构,而是创建一个XML文件,其中只包含使用适当模式创建的数据,并按照FAQ中的描述填写表单:

这正是我想要的!很好用!非常感谢。
public static void main(String[] args) throws IOException, TransformerException {
    PdfDocument pdf = new PdfDocument(new PdfReader(SRC));
    PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
    XfaForm xfa = form.getXfaForm();
    Document doc = xfa.getDomDocument();
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.transform(domSource, result);
    writer.flush();
    System.out.println(writer.toString());
}