将HTML转换为包含<;预处理>;带飞碟和ITEXT的标签

将HTML转换为包含<;预处理>;带飞碟和ITEXT的标签,itext,flying-saucer,Itext,Flying Saucer,我正在使用飞碟库将html转换为pdf。它可以很好地处理所有HTML文件 但对于一些HTML文件,其中包括一些标签在预标签,生成的PDF文件有标签显示 如果删除预标记,则数据格式将丢失 我的代码是 org.w3c.dom.Document document = null; try { Document doc = Jsoup.parse(new File(htmlFile), "UTF-8", ""); Whitelist wl = new R

我正在使用飞碟库将html转换为pdf。它可以很好地处理所有HTML文件

但对于一些HTML文件,其中包括一些标签在预标签,生成的PDF文件有标签显示

如果删除预标记,则数据格式将丢失

我的代码是

    org.w3c.dom.Document document = null;
    try {

        Document doc = Jsoup.parse(new File(htmlFile), "UTF-8", "");

        Whitelist wl = new RelaxedPlusDataBase64Images();
        Cleaner cleaner = new Cleaner(wl);

        doc = cleaner.clean(doc);
        Tidy tidy = new Tidy();
        tidy.setShowWarnings(false);
        tidy.setXmlTags(false);
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setPrintBodyOnly(true);
        tidy.setXHTML(true);
        tidy.setMakeClean(true);
        tidy.setAsciiChars(true);
        if (doc.select("pre").html().contains("</")) {
            doc.select("pre").unwrap();
        }
        Reader reader = new StringReader(doc.html());
        document = (tidy.parseDOM(reader, null));
        Element element = (Element) document.getElementsByTagName("head").item(0);
        element.getParentNode().removeChild(element);
        NodeList elements = document.getElementsByTagName("img");
        for (int i = 0; i < elements.getLength(); i++) {
            String value = elements.item(i).getAttributes().getNamedItem("src").getNodeValue();
            if (value != null && value.startsWith("cid:") && value.contains("@")) {
                value = value.substring(value.indexOf("cid:") + 4, value.indexOf("@"));
                elements.item(i).getAttributes().getNamedItem("src").setNodeValue(value);
                System.out.println(value);
            }

        }

        document.normalize();

        System.out.println(getNiceLyFormattedXMLDocument(document));
    } catch (Exception e) {
        System.out.println(e);
    }
通过使用itext XMLWorker:

  try {

        org.w3c.dom.Document doc = CleanHtml.cleanNTidyHTML("a.html");
        String k = CleanHtml.getNiceLyFormattedXMLDocument(doc);
        OutputStream file = new FileOutputStream(new File("test.pdf"));
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, file);
        document.open();
        ByteArrayInputStream is = new ByteArrayInputStream(k.getBytes());
        XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
        document.close();
        file.close();


    } catch (Exception e) {
        e.printStackTrace();
    }

    public static String getNiceLyFormattedXMLDocument(org.w3c.dom.Document doc) throws IOException, TransformerException {
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    // transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

    Writer stringWriter = new StringWriter();
    StreamResult streamResult = new StreamResult(stringWriter);
    transformer.transform(new DOMSource(doc), streamResult);
    String result = stringWriter.toString();

    return result;
}

你的问题错了。在标题中,您声称iText存在问题。然而,你使用的是飞碟。飞碟与iText没有任何关联!谢谢你的编辑。我还使用itext提供的XML worker将HTML转换为PDF文件。但同样的问题也发生在那里。上面更新了代码片段。
  try {

        org.w3c.dom.Document doc = CleanHtml.cleanNTidyHTML("a.html");
        String k = CleanHtml.getNiceLyFormattedXMLDocument(doc);
        OutputStream file = new FileOutputStream(new File("test.pdf"));
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, file);
        document.open();
        ByteArrayInputStream is = new ByteArrayInputStream(k.getBytes());
        XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
        document.close();
        file.close();


    } catch (Exception e) {
        e.printStackTrace();
    }

    public static String getNiceLyFormattedXMLDocument(org.w3c.dom.Document doc) throws IOException, TransformerException {
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    // transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

    Writer stringWriter = new StringWriter();
    StreamResult streamResult = new StreamResult(stringWriter);
    transformer.transform(new DOMSource(doc), streamResult);
    String result = stringWriter.toString();

    return result;
}