C#itextsharp将公文包PDF提取为文本_Itext

C#itextsharp将公文包PDF提取为文本

itext

C#itextsharp将公文包PDF提取为文本,itext,Itext,我遇到了一些投资组合的问题，它们似乎不是我过去处理过的常见格式，它们只有“孩子”，没有“EF”，只有“名字”和“限制”。PdfName.EF始终返回null。当格式如下时，如何提取PDF？完整代码如下，PDF非常大，1+gb，其中包含2400+PDF static void ExtractAttachments(string file_name, string folderName) { PdfDictionary documentNames = null;

我遇到了一些投资组合的问题，它们似乎不是我过去处理过的常见格式，它们只有“孩子”，没有“EF”，只有“名字”和“限制”。PdfName.EF始终返回null。当格式如下时，如何提取PDF？完整代码如下，PDF非常大，1+gb，其中包含2400+PDF

static void ExtractAttachments(string file_name, string folderName)
    {
        PdfDictionary documentNames = null;
        PdfDictionary embeddedFiles = null;
        PdfDictionary fileArray = null;
        PdfDictionary file = null;
        PRStream stream = null;
        PdfArray filespecs = null;
        using (PdfReader reader = new PdfReader(folderName + "//" + file_name))
        {
            PdfDictionary catalog = reader.Catalog;

            documentNames = (PdfDictionary)PdfReader.GetPdfObject(catalog.Get(PdfName.NAMES));

            if (documentNames != null)
            {
                embeddedFiles = (PdfDictionary)PdfReader.GetPdfObject(documentNames.Get(PdfName.EMBEDDEDFILES));
                if (embeddedFiles != null)
                {
                    filespecs = embeddedFiles.GetAsArray(PdfName.NAMES);

                    if (filespecs != null)
                    {
                        for (int i = 0; i < filespecs.Size; i++)
                        {
                            i++;
                            fileArray = filespecs.GetAsDict(i);
                            file = fileArray.GetAsDict(PdfName.EF);

                            foreach (PdfName key in file.Keys)
                            {
                                stream = (PRStream)PdfReader.GetPdfObject(file.GetAsIndirectObject(key));
                                string attachedFileName = fileArray.GetAsString(key).ToString();
                                byte[] attachedFileBytes = PdfReader.GetStreamBytes(stream);

                                System.IO.File.WriteAllBytes(System.IO.Path.Combine(folderName, attachedFileName), attachedFileBytes);
                            }
                        }
                    }
                    else
                    {
                        filespecs = embeddedFiles.GetAsArray(PdfName.KIDS);
                        if (filespecs != null)
                        {
                            for (int i = 0; i < filespecs.Size; i++)
                            {
                                filespecs.GetAsString(i);
                                PdfDictionary filespec = filespecs.GetAsDict(i);
                                //NO EF only NAMES AND LIMITS PROBLEM HERE.
                                PdfDictionary refs = filespec.GetAsDict(PdfName.EF);
                                if (refs != null)
                                {
                                    foreach (PdfName key in refs.Keys)
                                    {
                                        stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));

                                        string FileName = filespec.GetAsString(key).ToString().ToUpper();
                                        using (FileStream fs = new FileStream(
                                          "D:\\WillsPDFPortfolio\\" + filespec.GetAsString(key).ToString(), FileMode.OpenOrCreate
                                        ))
                                        {
                                            byte[] attachment = PdfReader.GetStreamBytes(stream);
                                            fs.Write(attachment, 0, attachment.Length);
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }

静态无效提取附件（字符串文件名、字符串文件夹名）
{
PdfDictionary documentNames=null；
PdfDictionary embeddedFiles=null；
PdfDictionary fileArray=null；
PdfDictionary file=null；
PRStream-stream=null；
PdfArray filespecs=null；
使用（PDF阅读器=新PDF阅读器（文件夹名+“/”+文件名））
{
PdfDictionary catalog=reader.catalog；
documentNames=（PdfDictionary）PdfReader.GetPdfObject（catalog.Get（PdfName.NAMES））；
if（documentNames！=null）
{
embeddedFiles=（PdfDictionary）PdfReader.GetPdfObject（documentNames.Get（PdfName.embeddedFiles））；
if（embeddedFiles！=null）
{
filespecs=embeddedFiles.GetAsArray（PdfName.NAMES）；
if（filespecs！=null）
{
对于（int i=0；i

看了你的示例PDF后，发现它完全符合规范。事实上，你最初的描述“他们只有“孩子”，他们没有“EF”，只有“名字”和“限制”，指向了这个方向，但你也将其描述为“不是一种通用格式”，我先要了一个示例

看一下说明书文档名称词典中的每个条目（
catalog.Get（PdfName.NAMES）
）指定名称树的根。（ISO 32000-1第7.7.4节）
例如，EmbeddedFiles条目（
documentNames.Get（PdfName.EmbeddedFiles）
）映射到一个名称树，该名称树将名称字符串映射到嵌入式文件流的文件规范。（ibidem）
名称树应始终只有一个根节点，该根节点应包含一个条目：Kids或Names，但不能同时包含这两个条目。如果根节点有一个名称条目，则它应该是树中唯一的节点。如果它有一个Kids条目，则每个剩余节点应为中间节点，其中应包含一个Limits条目和一个Kids条目，或者为叶节点，其中应包含一个Limits条目和一个名称条目。（ISO 32000-1第7.9.6节）
因此，您描述的结构
他们只有“孩子”，没有“EF”，只有“名字”和“限制”
这是根据规格
让我们继续介绍规范：
名称数组应为[key1 value1 key2 value2…keyn valuen]形式的数组，其中每个keyi应为字符串，相应的valuei应为与该键关联的对象。按键应按如下所述的词法顺序排序。（ibidem）
怎么办因此，您所要做的就是处理在子节点中找到的名称条目，就像处理顶级节点中找到的名称条目一样。在里面