使用声明的图形状态剪裁路径分析PDF文本位置
请原谅这篇冗长的帖子,但无论谁回答这个问题,毫无疑问都需要提供的所有信息 我在一个项目中成功地实现了PDF解析。在寻找特定PDF中精确文本定位的问题时,我发现我的计算是错误的。以下是PDF的相关片段(使用使用声明的图形状态剪裁路径分析PDF文本位置,pdf,Pdf,请原谅这篇冗长的帖子,但无论谁回答这个问题,毫无疑问都需要提供的所有信息 我在一个项目中成功地实现了PDF解析。在寻找特定PDF中精确文本定位的问题时,我发现我的计算是错误的。以下是PDF的相关片段(使用qpdf-qdf): 相关PDF代码片段 %% Page 1 8 0 obj << /Contents [ 10 0 R 12 0 R ] /MediaBox [ 0 0 612 792 ] /Type /Page >> endobj c
qpdf-qdf
):
相关PDF代码片段
%% Page 1
8 0 obj
<<
/Contents [
10 0 R
12 0 R
]
/MediaBox [ 0 0 612 792 ]
/Type /Page
>>
endobj
cm:修改ctm
| 0.94062 0 0 |
ctm = | 0 0.94062 0 |
| 26.16627 0 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
m:在0,0开始路径
l:指向595,0的行
l:指向595842
l:指向0842的行
l:行到0,0
h:关闭路径
No op since path already closed
n:与剪切路径相交
GS clipping path wasn't set, so GS clipping path is the closed path that
defines rect [0 0 595 842]
The incoming path is totally contained in current GS clipping path,
so the GS clipping path becomes the incoming path, which defines the rect
[16.84 16.84 561.32 808.32]
q:推送GS
re:创建闭合路径
Path defines rect [16.84 16.84 561.32 808.32]
n:与剪切路径相交
GS clipping path wasn't set, so GS clipping path is the closed path that
defines rect [0 0 595 842]
The incoming path is totally contained in current GS clipping path,
so the GS clipping path becomes the incoming path, which defines the rect
[16.84 16.84 561.32 808.32]
q:推送GS
cm:修改ctm
| 0.94062 0 0 |
ctm = | 0 0.94062 0 |
| 26.16627 0 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
BT:开始文本对象
Tm:设置文本和行矩阵
| 133 0 0 |
tm = lm = | 0 133 0 |
| 0 0 1 |
Tf:设置文本字体和字体大小
Set font to TT1, which has ascent=750, descent=-250, and glyph width T=667
(I will only need the width of T below). Font size is set to 1.
This has the effect of setting the first component in calculating the text
rendering matrix to
| 1 0 0 |
| 0 1 0 |
| 0 -0.25 1 |
Tc:设置文本字符间距
TJ:显示文本
T的PDF-rect计算
我们计算的目标文本将是第一个显示的文本,字母
T(因此,上面提供了唯一的图示符宽度)。执行文本显示
运算符TJ,我们首先按照PDF规范(PDF 32000-1:2008)第9.4.4节的规定计算Trm:
a = Tfs * Th = 1 * 1 = 1
d = Tfs = 1
ty = Trise = -250/1000 = -0.25
| a 0 0 |
Trm = | 0 d 0 | * Tm * ctm
| 0 ty 1 |
| 1 0 0 | | 133 0 0 | | 0.226 0 0 |
Trm = | 0 1 0 | * | 0 133 0 | * | 0 0.226 0 |
| 0 -0.25 1 | | 0 0 1 | | 97.04 740.44 1 |
| 30.06 0 0 |
= | 0 30.06 0 |
| 97.04 732.93 1 |
还将计算轮廓宽度(水平位移)和轮廓高度
根据第9.4.4节:
width = width of 'T' / 1000 = 0.667
height = (ascent - descent) / 1000 = (750 + 250) / 1000 = 1
字母T的文本rect为:
textRect for 'T' = [ 0 decent/1000, width, height ]
= [ 0 -0.25, 0.667, 1 ]
T在由ctm转换的文本rect处呈现。计算T的结果PDF rect:
PDF rect:97.04732.9320.0330.02
我已经在Acrobat和Preview中为T创建了突出显示注释
并使用这些信息来确定PDF位置的计算
T在这两个程序中:
Acrobat:111.54718.9920.0330.02
预览:111.54718.9920.0330.02
与Acrobat和Preview相比,我有dx=-14.5和dy=13.31,也就是说,我是
距离T的实际位置太远太高
对于未更改媒体盒或未声明图形的PDF
我所有的计算都准确无误。我知道一定有关系
PDF对象中的不同媒体框声明8 0和9 0或,
更有可能的是,由m,l,h,n,
和re运算符,这导致rect剪切路径rect为
[ 16.84 16.84 561.32 808.32 ]
而媒体盒
[ 0 0 595 842 ]
我在PDF规范中找不到任何表明由于图形状态剪辑路径(再次假设为
是罪魁祸首)
啊。我遗漏了什么?这里有一个错误:
cm:修改ctm
| 0.94062 0 0 |
ctm = | 0 0.94062 0 |
| 26.16627 0 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
将从右侧到现有转换的更改相乘,但必须从左侧进行。Egads。那是多么愚蠢啊。现在我意识到我以前的解析从来没有遇到过这样的两个cm操作,并且剪切路径是一条危险的线索!一个简单的切换到矩阵级联的顺序,一切都很好。谢谢