Java 使用pdfbox创建新的自定义COSBase对象？_Java_Accessibility_Pdfbox_Tagging

Java 使用pdfbox创建新的自定义COSBase对象？

java

Java 使用pdfbox创建新的自定义COSBase对象？,java,accessibility,pdfbox,tagging,Java,Accessibility,Pdfbox,Tagging,我们是否可以创建一个新的自定义PDFOperator（如PDFOperator{BDC}）和COSBase对象（如COSName{p}COSName{Prop1}（同样，Prop1将引用另一个obj））？并将其添加到pdf的根结构中我已经从现有的pdf文档中阅读了一些解析器标记列表。我想给pdf加上标签。在这个过程中，我将首先使用新创建的COSBase对象操作令牌列表。最后，我将把它们添加到根树结构中。那么在这里我如何创建一个COSBase对象。我正在使用代码从pdf is中提取令牌 old_

我们是否可以创建一个新的自定义PDFOperator（如PDFOperator{BDC}）和COSBase对象（如COSName{p}COSName{Prop1}（同样，Prop1将引用另一个obj））？并将其添加到pdf的根结构中

我已经从现有的pdf文档中阅读了一些解析器标记列表。我想给pdf加上标签。在这个过程中，我将首先使用新创建的COSBase对象操作令牌列表。最后，我将把它们添加到根树结构中。那么在这里我如何创建一个COSBase对象。我正在使用代码从pdf is中提取令牌

old_document = PDDocument.load(new File(inputPdfFile));
List<Object> newTokens = new ArrayList<>();
for (PDPage page : old_document.getPages()) 
{
    PDFStreamParser parser = new PDFStreamParser(page);
    parser.parse();
    List<Object> tokens = parser.getTokens();
    for (Object token : tokens) {
        System.out.println(token);
        if (token instanceof Operator) {
            Operator op = (Operator) token;     
        }
}
newTokens.add(token);
}

PDStream newContents = new PDStream(document);
document.addPage(page);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
document.save(outputPdfFile);
document.close();

old_document=PDDocument.load（新文件（inputpdfile））；
List newTokens=newarraylist（）；
对于（PDPage页面：旧文档.getPages（））
{
PDFStreamParser=新的PDFStreamParser（第页）；
parser.parse（）；
List tokens=parser.getTokens（）；
for（对象标记：标记）{
System.out.println（令牌）；
if（令牌instanceof运算符）{
运算符op=（运算符）令牌；
}
}
新增（代币）；
}
PDStream newContents=新的PDStream（文档）；
文件。添加页（第页）；
OutputStream out=newContents.createOutputStream（COSName.FLATE\u DECODE）；
ContentStreamWriter writer=新的ContentStreamWriter（输出）；
作家，作家（纽顿）；
out.close（）；
page.setContents（newContents）；
文档保存（outputpdfile）；
document.close（）；

以上代码将创建一个包含所有格式和图像的新pdf。因此，在newTokens中，列表包含所有现有的COSBase对象，因此我想使用一些标记COSBase对象进行操作，如果我保存了新文档，则应在不考虑任何解码、编码、字体和图像处理的情况下对其进行标记

首先，这个想法行得通吗？如果是，请帮助我编写一些代码来创建自定义的COSBase对象。我对java非常陌生。

根据您的文档格式，您可以插入标记的内容

//Below code is to add   "/p <<MCID 0>> /BDC"

newTokens.add(COSName.getPDFName("P"));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
mcid++;
newTokens.add(currentMarkedContentDictionary);
newTokens.add(Operator.getOperator("BDC"));

// After adding mcid you have to append your existing tokens TJ , TD, Td, T* ....
newTokens.add(existing_token);
// Closed EMC
newTokens.add(Operator.getOperator("EMC"));
//Adding marked content to the root tree structure.
structureElement = new PDStructureElement(StandardStructureTypes.P, currentSection);
structureElement.setPage(page);
PDMarkedContent markedContent = new PDMarkedContent(COSName.P, currentMarkedContentDictionary);
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);

//下面的代码是要添加的“/p/BDC”
add（COSName.getPDFName（“P”）；
currentMarkedContentDictionary=新COSDictionary（）；
currentMarkedContentDictionary.setInt（COSName.MCID，MCID）；
mcid++；
添加（currentMarkedContentDictionary）；
add（Operator.getOperator（“BDC”）；
//添加mcid后，您必须附加现有令牌TJ、TD、TD、T*。。。。
newTokens.add（现有的_令牌）；
//封闭式EMC
newTokens.add（Operator.getOperator（“EMC”））；
//将标记的内容添加到根树结构。
structureElement=新的PDStructureElement（StandardStructureTypes.P，currentSection）；
structurelement.setPage（第页）；
PDMarkedContent markedContent=新的PDMarkedContent（COSName.P，currentMarkedContentDictionary）；
structureElement.appendKid（markedContent）；
currentSection.appendKid（结构元素）；

多亏了@Tilman Hausherr

，我认为最好的策略是使用现有的最小标记PDF，使用PDFDebugger进行分析，并尝试使用PDFBox进行复制。我不认为我们需要一门新课。也许这有帮助？或者这个：Hi@TilmanHausherr，我只想标记已经存在的未标记的pdf文档。如果我使用PDFDebugger并使用“currentContentStream”重新创建新的pdf，我们必须处理所有操作符（Tj、Tj、TD、TD、Tm、Tw、Tc、cm、Do、p、r、w……，这些操作符的结构将从一个pdf更改为另一个），以获得相同的pdf样式。如何有效地标记现有pdf？我如何处理不同类型的文本（type0，type1…）。有些是完全编码的？在下面的pdf中，他们使用了type0字体。当我读取每个Tj时，它被编码。所以ContentStream.showtext（“XMFJAJKFIANRKGIADFKEWIAFJA”）在这里不起作用我如何标记这个文档？