使用PDFBox获取每行的字体

使用PDFBox获取每行的字体,pdf,fonts,pdfbox,Pdf,Fonts,Pdfbox,有没有办法使用PDFBox获取PDF文件每行的字体?我尝试过这个,但它只是列出了该页面中使用的所有字体。它不显示该字体中显示的行或文本 List<PDPage> pages = doc.getDocumentCatalog().getAllPages(); for(PDPage page:pages) { Map<String,PDFont> pageFonts=page.getResources().getFonts(); for(String key : pageFo

有没有办法使用PDFBox获取PDF文件每行的字体?我尝试过这个,但它只是列出了该页面中使用的所有字体。它不显示该字体中显示的行或文本

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
   {
    System.out.println(key+" - "+pageFonts.get(key));
    System.out.println(pageFonts.get(key).getBaseFont());
    }
}
List pages=doc.getDocumentCatalog().getAllPages();
用于(第页:页)
{
Map pageFonts=page.getResources().getFonts();
for(字符串键:pageFonts.keySet())
{
System.out.println(key+“-”+pageFonts.get(key));
System.out.println(pageFonts.get(key.getBaseFont());
}
}

欢迎您的任何意见。谢谢

无论何时尝试使用PDFBox从PDF中提取文本(纯文本或带有样式信息),通常都应该开始尝试使用
PDFTextStripper
类或其相关类。这个类已经为您完成了PDF内容解析中涉及的所有繁重工作

您可以像这样使用普通的
PDFTextStripper
类:

PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);
这仅返回纯文本,例如从某些R40表单返回:

另一方面,您可以覆盖其方法
writeString(String,List)
,并处理比文本更多的信息。要在字体发生更改的地方添加有关所用字体名称的信息,可以使用以下方法:

PDFTextStripper stripper = new PDFTextStripper() {
    String prevBaseFont = "";

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder builder = new StringBuilder();

        for (TextPosition position : textPositions)
        {
            String baseFont = position.getFont().getBaseFont();
            if (baseFont != null && !baseFont.equals(prevBaseFont))
            {
                builder.append('[').append(baseFont).append(']');
                prevBaseFont = baseFont;
            }
            builder.append(position.getCharacter());
        }

        writeString(builder.toString());
    }
};
如果不希望字体信息与文本合并,只需在方法中创建单独的结构即可


TextPosition
提供了有关它所代表的文本的更多信息。检查一下

要添加到mkl的答案中,如果您使用的是pdfbox 2.0.8:

  • 使用
    position.getFont().getName()
    而不是
    position.getFont().getBaseFont()
  • 使用
    position.getUnicode()
    而不是
    position.getCharacter()
更多关于和的信息可以在他们的Javadocs在线上找到

PDFTextStripper stripper = new PDFTextStripper() {
    String prevBaseFont = "";

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder builder = new StringBuilder();

        for (TextPosition position : textPositions)
        {
            String baseFont = position.getFont().getBaseFont();
            if (baseFont != null && !baseFont.equals(prevBaseFont))
            {
                builder.append('[').append(baseFont).append(']');
                prevBaseFont = baseFont;
            }
            builder.append(position.getCharacter());
        }

        writeString(builder.toString());
    }
};
[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...