C# 如何使用PDFBox阅读PDF部门(标题、摘要、参考资料)?
我试图阅读PDF文件及其部门,但我找不到正确的算法或库 我想分离文件的各个部分(标题、摘要、引用)并返回它们的内容C# 如何使用PDFBox阅读PDF部门(标题、摘要、参考资料)?,c#,pdfbox,C#,Pdfbox,我试图阅读PDF文件及其部门,但我找不到正确的算法或库 我想分离文件的各个部分(标题、摘要、引用)并返回它们的内容 是否存在解决此问题的参考?遗憾的是,OP作为代表性示例提供的文件没有标记。因此,没有直接的信息表明给定的文本是否属于标题、摘要、参考文献或任何部分。因此,没有确定的方法来识别这些部分,而仅仅是启发式,即受过教育的猜测,或多或少有很大的错误率 对于OP提供的样本文档,零件的识别实际上可以通过简单检查每行第一个字母的字体来完成 下面的类构成了一个简单的框架,用于提取语义文本部分,这些部
是否存在解决此问题的参考?遗憾的是,OP作为代表性示例提供的文件没有标记。因此,没有直接的信息表明给定的文本是否属于标题、摘要、参考文献或任何部分。因此,没有确定的方法来识别这些部分,而仅仅是启发式,即受过教育的猜测,或多或少有很大的错误率 对于OP提供的样本文档,零件的识别实际上可以通过简单检查每行第一个字母的字体来完成 下面的类构成了一个简单的框架,用于提取语义文本部分,这些部分仅通过每行的特征就可以识别,并通过仅检查每行第一个字符的字体来识别OP示例文件中的部分 简单文本节提取框架 由于我只使用过PDFBox的Java版本,OP声明Java解决方案也可以,所以该框架是用Java实现的。它基于PDFBox的当前开发版本2.1.0-SNAPSHOT
PDFTextSectionStripper
此类构成了框架的中心。它派生自PDFBoxPdfTextStripper
,并通过识别由TextSectionDefinition
实例列表配置的文本节来扩展该类,如下所示。调用PdfTextStripper
方法getText
后,识别的部分将作为textcreation
实例列表提供,请参见下文
public class PDFTextSectionStripper extends PDFTextStripper
{
//
// constructor
//
public PDFTextSectionStripper(List<TextSectionDefinition> sectionDefinitions) throws IOException
{
super();
this.sectionDefinitions = sectionDefinitions;
}
//
// Section retrieval
//
/**
* @return an unmodifiable list of text sections recognized during {@link #getText(PDDocument)}.
*/
public List<TextSection> getSections()
{
return Collections.unmodifiableList(sections);
}
//
// PDFTextStripper overrides
//
@Override
protected void writeLineSeparator() throws IOException
{
super.writeLineSeparator();
if (!currentLine.isEmpty())
{
boolean matched = false;
if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
{
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
switch (definition.multiLine)
{
case multiLine:
if (definition.matchPredicate.test(currentLine))
{
currentBody.add(new ArrayList<>(currentLine));
matched = true;
}
break;
case multiLineHeader:
case multiLineIntro:
boolean followUpMatch = false;
for (int i = definition.multiple ? currentSectionDefinition : currentSectionDefinition + 1;
i < sectionDefinitions.size(); i++)
{
TextSectionDefinition followUpDefinition = sectionDefinitions.get(i);
if (followUpDefinition.matchPredicate.test(currentLine))
{
followUpMatch = true;
break;
}
}
if (!followUpMatch)
{
currentBody.add(new ArrayList<>(currentLine));
matched = true;
}
break;
case singleLine:
System.out.println("Internal error: There can be no current header or body as long as the current definition is single line only");
}
if (!matched)
{
sections.add(new TextSection(definition, currentHeader, currentBody));
currentHeader.clear();
currentBody.clear();
if (!definition.multiple)
currentSectionDefinition++;
}
}
if (!matched)
{
while (currentSectionDefinition < sectionDefinitions.size())
{
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
if (definition.matchPredicate.test(currentLine))
{
matched = true;
switch (definition.multiLine)
{
case singleLine:
sections.add(new TextSection(definition, currentLine, Collections.emptyList()));
if (!definition.multiple)
currentSectionDefinition++;
break;
case multiLineHeader:
currentHeader.addAll(new ArrayList<>(currentLine));
break;
case multiLine:
case multiLineIntro:
currentBody.add(new ArrayList<>(currentLine));
break;
}
break;
}
currentSectionDefinition++;
}
}
if (!matched)
{
System.out.println("Could not match line.");
}
}
currentLine.clear();
}
@Override
protected void endDocument(PDDocument document) throws IOException
{
super.endDocument(document);
if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
{
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
sections.add(new TextSection(definition, currentHeader, currentBody));
currentHeader.clear();
currentBody.clear();
}
}
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
super.writeString(text, textPositions);
currentLine.add(textPositions);
}
//
// member variables
//
final List<TextSectionDefinition> sectionDefinitions;
int currentSectionDefinition = 0;
final List<TextSection> sections = new ArrayList<>();
final List<List<TextPosition>> currentLine = new ArrayList<>();
final List<List<TextPosition>> currentHeader = new ArrayList<>();
final List<List<List<TextPosition>>> currentBody = new ArrayList<>();
}
()
TextSection
此类表示此框架识别的文本部分
public class TextSection
{
public TextSection(TextSectionDefinition definition, List<List<TextPosition>> header, List<List<List<TextPosition>>> body)
{
this.definition = definition;
this.header = new ArrayList<>(header);
this.body = new ArrayList<>(body);
}
@Override
public String toString()
{
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(definition.name).append(": ");
if (!header.isEmpty())
stringBuilder.append(toString(header));
stringBuilder.append('\n');
for (List<List<TextPosition>> bodyLine : body)
{
stringBuilder.append(" ").append(toString(bodyLine)).append('\n');
}
return stringBuilder.toString();
}
String toString(List<List<TextPosition>> words)
{
StringBuilder stringBuilder = new StringBuilder();
boolean first = true;
for (List<TextPosition> word : words)
{
if (first)
first = false;
else
stringBuilder.append(' ');
for (TextPosition textPosition : word)
{
stringBuilder.append(textPosition.getUnicode());
}
}
// cf. https://stackoverflow.com/a/7171932/1729265
return Normalizer.normalize(stringBuilder, Form.NFKC);
}
final TextSectionDefinition definition;
final List<List<TextPosition>> header;
final List<List<List<TextPosition>>> body;
}
(试验方法testWang05a
)
结果是:
Titel: How to Break MD5 and Other Hash Functions
Authors:
Xiaoyun Wang and Hongbo Yu
Institutions:
Shandong University, Jinan 250100, China,
Addresses:
xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn
Abstract:
Abstract. MD5 is one of the most widely used cryptographic hash func-
tions nowadays. It was designed in 1992 as an improvement of MD4, and
...
Section: 1 Introduction
People know that digital signatures are very important in information security.
The security of digital signatures depends on the cryptographic strength of the
...
Section: 2 Description of MD5
In order to conveniently describe the general structure of MD5, we first recall
the iteration process for hash functions.
...
Section: 3 Differential Attack for Hash Functions
3.1 The Modular Differential and the XOR Differential
The most important analysis method for hash functions is differential attack
...
Section: 4 Differential Attack on MD5
4.1 Notation
Before presenting our attack, we first introduce some notation to simplify the
...
Section: 5 Summary
In this paper we described a powerful attack against hash functions, and in
particular showed that finding a collision of MD5 is easily feasible.
...
Section: Acknowledgements
It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and
...
Section: References
1. E. Biham, A. Shamir. Differential Cryptanalysis of the Data Encryption Standard,
Springer-Verlag, 1993.
...
对于更通用的文本部分识别,显然不能指望使用这些特定的TeX字体来表示特定的文本部分。相反,您可能需要查看字体大小(记住不要使用简单的字体大小属性,而是根据转换和文本矩阵进行缩放!)、对齐方式等。可能需要首先扫描文档以确定常见的文本大小等
但是,对于同一杂志中发布的多个文档,识别谓词实际上可能与上面的示例一样简单,因为在这种情况下,作者通常必须遵守非常具体的布局和格式规则。如果您指的是提取表,则PDFBox无法做到这一点,除非您确切知道所有内容的位置。也许tabla可以帮你,这是在PDFBox上面。你想要哪种PDF?我问,因为任务变得更困难,你必须考虑的PDF集越大。@ MKL,我用PDFox读取PEPEX和管理PEAPEX扫描库,共享一组有代表性的PDF来分析模式来识别所搜索的部分。
List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
new TextSectionDefinition("Titel", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.singleLine, false),
new TextSectionDefinition("Authors", x->x.get(0).get(0).getFont().getName().contains("CMR10"), MultiLine.multiLine, false),
new TextSectionDefinition("Institutions", x->x.get(0).get(0).getFont().getName().contains("CMR9"), MultiLine.multiLine, false),
new TextSectionDefinition("Addresses", x->x.get(0).get(0).getFont().getName().contains("CMTT9"), MultiLine.multiLine, false),
new TextSectionDefinition("Abstract", x->x.get(0).get(0).getFont().getName().contains("CMBX9"), MultiLine.multiLineIntro, false),
new TextSectionDefinition("Section", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.multiLineHeader, true)
);
PDDocument document = PDDocument.load(resource);
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);
System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections())
{
String text = textSection.toString();
System.out.println(text);
texts.add(text);
}
Files.write(new File(RESULT_FOLDER, "Wang05a.txt").toPath(), texts);
Titel: How to Break MD5 and Other Hash Functions
Authors:
Xiaoyun Wang and Hongbo Yu
Institutions:
Shandong University, Jinan 250100, China,
Addresses:
xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn
Abstract:
Abstract. MD5 is one of the most widely used cryptographic hash func-
tions nowadays. It was designed in 1992 as an improvement of MD4, and
...
Section: 1 Introduction
People know that digital signatures are very important in information security.
The security of digital signatures depends on the cryptographic strength of the
...
Section: 2 Description of MD5
In order to conveniently describe the general structure of MD5, we first recall
the iteration process for hash functions.
...
Section: 3 Differential Attack for Hash Functions
3.1 The Modular Differential and the XOR Differential
The most important analysis method for hash functions is differential attack
...
Section: 4 Differential Attack on MD5
4.1 Notation
Before presenting our attack, we first introduce some notation to simplify the
...
Section: 5 Summary
In this paper we described a powerful attack against hash functions, and in
particular showed that finding a collision of MD5 is easily feasible.
...
Section: Acknowledgements
It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and
...
Section: References
1. E. Biham, A. Shamir. Differential Cryptanalysis of the Data Encryption Standard,
Springer-Verlag, 1993.
...