C# 如何使用PDFBox阅读PDF部门（标题、摘要、参考资料）？_C#_Pdfbox

C# 如何使用PDFBox阅读PDF部门（标题、摘要、参考资料）？

C# 如何使用PDFBox阅读PDF部门（标题、摘要、参考资料）？,c#,pdfbox,C#,Pdfbox,我试图阅读PDF文件及其部门，但我找不到正确的算法或库我想分离文件的各个部分（标题、摘要、引用）并返回它们的内容是否存在解决此问题的参考？遗憾的是，OP作为代表性示例提供的文件没有标记。因此，没有直接的信息表明给定的文本是否属于标题、摘要、参考文献或任何部分。因此，没有确定的方法来识别这些部分，而仅仅是启发式，即受过教育的猜测，或多或少有很大的错误率对于OP提供的样本文档，零件的识别实际上可以通过简单检查每行第一个字母的字体来完成下面的类构成了一个简单的框架，用于提取语义文本部分，这些部

我试图阅读PDF文件及其部门，但我找不到正确的算法或库

我想分离文件的各个部分（标题、摘要、引用）并返回它们的内容

是否存在解决此问题的参考？

遗憾的是，OP作为代表性示例提供的文件没有标记。因此，没有直接的信息表明给定的文本是否属于标题、摘要、参考文献或任何部分。因此，没有确定的方法来识别这些部分，而仅仅是启发式，即受过教育的猜测，或多或少有很大的错误率

对于OP提供的样本文档，零件的识别实际上可以通过简单检查每行第一个字母的字体来完成

下面的类构成了一个简单的框架，用于提取语义文本部分，这些部分仅通过每行的特征就可以识别，并通过仅检查每行第一个字符的字体来识别OP示例文件中的部分

简单文本节提取框架由于我只使用过PDFBox的Java版本，OP声明Java解决方案也可以，所以该框架是用Java实现的。它基于PDFBox的当前开发版本2.1.0-SNAPSHOT

PDFTextSectionStripper

此类构成了框架的中心。它派生自PDFBox

PdfTextStripper

，并通过识别由

TextSectionDefinition

实例列表配置的文本节来扩展该类，如下所示。调用

PdfTextStripper

方法

getText

后，识别的部分将作为

textcreation

实例列表提供，请参见下文

public class PDFTextSectionStripper extends PDFTextStripper
{
    //
    // constructor
    //
    public PDFTextSectionStripper(List<TextSectionDefinition> sectionDefinitions) throws IOException
    {
        super();
        
        this.sectionDefinitions = sectionDefinitions;
    }

    //
    // Section retrieval
    //
    /**
     * @return an unmodifiable list of text sections recognized during {@link #getText(PDDocument)}.
     */
    public List<TextSection> getSections()
    {
        return Collections.unmodifiableList(sections);
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();

        if (!currentLine.isEmpty())
        {
            boolean matched = false;
            if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
            {
                TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
                switch (definition.multiLine)
                {
                case multiLine:
                    if (definition.matchPredicate.test(currentLine))
                    {
                        currentBody.add(new ArrayList<>(currentLine));
                        matched = true;
                    }
                    break;
                case multiLineHeader:
                case multiLineIntro:
                    boolean followUpMatch = false;
                    for (int i = definition.multiple ? currentSectionDefinition : currentSectionDefinition + 1;
                            i < sectionDefinitions.size(); i++)
                    {
                        TextSectionDefinition followUpDefinition = sectionDefinitions.get(i);
                        if (followUpDefinition.matchPredicate.test(currentLine))
                        {
                            followUpMatch = true;
                            break;
                        }
                    }
                    if (!followUpMatch)
                    {
                        currentBody.add(new ArrayList<>(currentLine));
                        matched = true;
                    }
                    break;
                case singleLine:
                    System.out.println("Internal error: There can be no current header or body as long as the current definition is single line only");
                }

                if (!matched)
                {
                    sections.add(new TextSection(definition, currentHeader, currentBody));
                    currentHeader.clear();
                    currentBody.clear();
                    if (!definition.multiple)
                        currentSectionDefinition++;
                }
            }

            if (!matched)
            {
                while (currentSectionDefinition < sectionDefinitions.size())
                {
                    TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
                    if (definition.matchPredicate.test(currentLine))
                    {
                        matched = true;
                        switch (definition.multiLine)
                        {
                        case singleLine:
                            sections.add(new TextSection(definition, currentLine, Collections.emptyList()));
                            if (!definition.multiple)
                                currentSectionDefinition++;
                            break;
                        case multiLineHeader:
                            currentHeader.addAll(new ArrayList<>(currentLine));
                            break;
                        case multiLine:
                        case multiLineIntro:
                            currentBody.add(new ArrayList<>(currentLine));
                            break;
                        }
                        break;
                    }

                    currentSectionDefinition++;
                }
            }

            if (!matched)
            {
                System.out.println("Could not match line.");
            }
        }
        currentLine.clear();
    }

    @Override
    protected void endDocument(PDDocument document) throws IOException
    {
        super.endDocument(document);

        if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
        {
            TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
            sections.add(new TextSection(definition, currentHeader, currentBody));
            currentHeader.clear();
            currentBody.clear();
        }
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        super.writeString(text, textPositions);

        currentLine.add(textPositions);
    }
    
    //
    // member variables
    //
    final List<TextSectionDefinition> sectionDefinitions;

    int currentSectionDefinition = 0;
    final List<TextSection> sections = new ArrayList<>();
    final List<List<TextPosition>> currentLine = new ArrayList<>();

    final List<List<TextPosition>> currentHeader = new ArrayList<>();
    final List<List<List<TextPosition>>> currentBody = new ArrayList<>();
}

（）

TextSection

此类表示此框架识别的文本部分

public class TextSection
{
    public TextSection(TextSectionDefinition definition, List<List<TextPosition>> header, List<List<List<TextPosition>>> body)
    {
        this.definition = definition;
        this.header = new ArrayList<>(header);
        this.body = new ArrayList<>(body);
    }

    @Override
    public String toString()
    {
        StringBuilder stringBuilder = new StringBuilder();
        stringBuilder.append(definition.name).append(": ");
        if (!header.isEmpty())
            stringBuilder.append(toString(header));
        stringBuilder.append('\n');
        for (List<List<TextPosition>> bodyLine : body)
        {
            stringBuilder.append("    ").append(toString(bodyLine)).append('\n');
        }
        return stringBuilder.toString();
    }

    String toString(List<List<TextPosition>> words)
    {
        StringBuilder stringBuilder = new StringBuilder();
        boolean first = true;
        for (List<TextPosition> word : words)
        {
            if (first)
                first = false;
            else
                stringBuilder.append(' ');
            for (TextPosition textPosition : word)
            {
                stringBuilder.append(textPosition.getUnicode());
            }
        }
        // cf. https://stackoverflow.com/a/7171932/1729265
        return Normalizer.normalize(stringBuilder, Form.NFKC);
    }

    final TextSectionDefinition definition;
    final List<List<TextPosition>> header;
    final List<List<List<TextPosition>>> body;
}

（试验方法

testWang05a

）

结果是：

Titel: How to Break MD5 and Other Hash Functions

Authors: 
    Xiaoyun Wang and Hongbo Yu

Institutions: 
    Shandong University, Jinan 250100, China,

Addresses: 
    xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn

Abstract: 
    Abstract. MD5 is one of the most widely used cryptographic hash func-
    tions nowadays. It was designed in 1992 as an improvement of MD4, and
    ...

Section: 1 Introduction
    People know that digital signatures are very important in information security.
    The security of digital signatures depends on the cryptographic strength of the
    ...

Section: 2 Description of MD5
    In order to conveniently describe the general structure of MD5, we first recall
    the iteration process for hash functions.
    ...

Section: 3 Differential Attack for Hash Functions
    3.1 The Modular Differential and the XOR Differential
    The most important analysis method for hash functions is differential attack
    ...

Section: 4 Differential Attack on MD5
    4.1 Notation
    Before presenting our attack, we first introduce some notation to simplify the
    ...

Section: 5 Summary
    In this paper we described a powerful attack against hash functions, and in
    particular showed that finding a collision of MD5 is easily feasible.
    ...

Section: Acknowledgements
    It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
    this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and
    ...

Section: References
    1. E. Biham, A. Shamir. Differential Cryptanalysis of the Data Encryption Standard,
    Springer-Verlag, 1993.
    ...

对于更通用的文本部分识别，显然不能指望使用这些特定的TeX字体来表示特定的文本部分。相反，您可能需要查看字体大小（记住不要使用简单的字体大小属性，而是根据转换和文本矩阵进行缩放！）、对齐方式等。可能需要首先扫描文档以确定常见的文本大小等

但是，对于同一杂志中发布的多个文档，识别谓词实际上可能与上面的示例一样简单，因为在这种情况下，作者通常必须遵守非常具体的布局和格式规则。

如果您指的是提取表，则PDFBox无法做到这一点，除非您确切知道所有内容的位置。也许tabla可以帮你，这是在PDFBox上面。你想要哪种PDF？我问，因为任务变得更困难，你必须考虑的PDF集越大。@ MKL，我用PDFox读取PEPEX和管理PEAPEX扫描库，共享一组有代表性的PDF来分析模式来识别所搜索的部分。

List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
        new TextSectionDefinition("Titel", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.singleLine, false),
        new TextSectionDefinition("Authors", x->x.get(0).get(0).getFont().getName().contains("CMR10"), MultiLine.multiLine, false),
        new TextSectionDefinition("Institutions", x->x.get(0).get(0).getFont().getName().contains("CMR9"), MultiLine.multiLine, false),
        new TextSectionDefinition("Addresses", x->x.get(0).get(0).getFont().getName().contains("CMTT9"), MultiLine.multiLine, false),
        new TextSectionDefinition("Abstract", x->x.get(0).get(0).getFont().getName().contains("CMBX9"), MultiLine.multiLineIntro, false),
        new TextSectionDefinition("Section", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.multiLineHeader, true)
        );

PDDocument document = PDDocument.load(resource);
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);

System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections())
{
    String text = textSection.toString();
    System.out.println(text);
    texts.add(text);
}
Files.write(new File(RESULT_FOLDER, "Wang05a.txt").toPath(), texts);

Titel: How to Break MD5 and Other Hash Functions

Authors: 
    Xiaoyun Wang and Hongbo Yu

Institutions: 
    Shandong University, Jinan 250100, China,

Addresses: 
    xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn

Abstract: 
    Abstract. MD5 is one of the most widely used cryptographic hash func-
    tions nowadays. It was designed in 1992 as an improvement of MD4, and
    ...

Section: 1 Introduction
    People know that digital signatures are very important in information security.
    The security of digital signatures depends on the cryptographic strength of the
    ...

Section: 2 Description of MD5
    In order to conveniently describe the general structure of MD5, we first recall
    the iteration process for hash functions.
    ...

Section: 3 Differential Attack for Hash Functions
    3.1 The Modular Differential and the XOR Differential
    The most important analysis method for hash functions is differential attack
    ...

Section: 4 Differential Attack on MD5
    4.1 Notation
    Before presenting our attack, we first introduce some notation to simplify the
    ...

Section: 5 Summary
    In this paper we described a powerful attack against hash functions, and in
    particular showed that finding a collision of MD5 is easily feasible.
    ...

Section: Acknowledgements
    It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
    this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and
    ...

Section: References
    1. E. Biham, A. Shamir. Differential Cryptanalysis of the Data Encryption Standard,
    Springer-Verlag, 1993.
    ...