C# 当从iTextSharp使用LocationTextExtractionStrategy时,如何避免textchunk中的错误值?
多年来,我一直与iTextSharp库合作,使用扩展名从PDF文件中提取文本。这是给我所有的话和他们的立场 但是现在,在一个新的PDF(使用iText 1.4.3生成)中,我有一些来自同一行的块,正如您在图像示例中看到的C# 当从iTextSharp使用LocationTextExtractionStrategy时,如何避免textchunk中的错误值?,c#,pdf,itextsharp,C#,Pdf,Itextsharp,多年来,我一直与iTextSharp库合作,使用扩展名从PDF文件中提取文本。这是给我所有的话和他们的立场 但是现在,在一个新的PDF(使用iText 1.4.3生成)中,我有一些来自同一行的块,正如您在图像示例中看到的 Text: S startLocation x:122 y:110.64 z:1 endLocation x:126.8 y:125.04 z:1 Text: e startLocation x:126.8 y:110.64 z:1 endLocation x:131.6
Text: S startLocation x:122 y:110.64 z:1 endLocation x:126.8 y:125.04 z:1
Text: e startLocation x:126.8 y:110.64 z:1 endLocation x:131.6 y:125.04 z:1
Text: x startLocation x:131.6 y:110.64 z:1 endLocation x:136.4 y:125.04 z:1
Text: L startLocation x:122 y:135.3 z:1 endLocation x:126.8 y:226.5 z:1
Text: a startLocation x:126.8 y:135.3 z:1 endLocation x:131.6 y:226.5 z:1
Text: s startLocation x:131.6 y:135.3 z:1 endLocation x:136.4 y:226.5 z:1
Text: t startLocation x:136.4 y:135.3 z:1 endLocation x:141.2 y:226.5 z:1
Text: n startLocation x:141.2 y:135.3 z:1 endLocation x:146 y:226.5 z:1
Text: a startLocation x:146 y:135.3 z:1 endLocation x:150.8 y:226.5 z:1
Text: m startLocation x:150.8 y:135.3 z:1 endLocation x:155.6 y:226.5 z:1
Text: e startLocation x:155.6 y:135.3 z:1 endLocation x:160.4 y:226.5 z:1
在生成文本Chunck之前,请给我:
S|distParallelStart 143.5421|distParallelEnd 158.7211| distPerpendicular 81 | orientationMagnitude 1249|orientationVector 0,3162279, 0,9486833, 0
e|distParallelStart 145.06 |distParallelEnd 160.239 | distPerpendicular 85 | orientationMagnitude 1249|orientationVector 0,3162279, 0,9486833, 0
x|distParallelStart 146.5779|distParallelEnd 161.7569| distPerpendicular 90 | orientationMagnitude 1249|orientationVector 0,3162279, 0,9486833, 0
L|distParallelStart 141.5252|distParallelEnd 232.8514| distPerpendicular 115| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
a|distParallelStart 141.7775|distParallelEnd 233.1037| distPerpendicular 120| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
s|distParallelStart 142.0297|distParallelEnd 233.356 | distPerpendicular 124| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
t|distParallelStart 142.282 |distParallelEnd 233.6083| distPerpendicular 129| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
n|distParallelStart 142.5343|distParallelEnd 233.8605| distPerpendicular 134| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
a|distParallelStart 142.7866|distParallelEnd 234.1128| distPerpendicular 139| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
m|distParallelStart 143.0389|distParallelEnd 234.3651| distPerpendicular 143| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
e|distParallelStart 143.2912|distParallelEnd 234.6174| distPerpendicular 148| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
关于两个块是否在同一行的代码返回false(因为distvertical不同:
virtual public bool SameLine(TextChunk a){
if (orientationMagnitude != a.orientationMagnitude) return false;
if (distPerpendicular != a.distPerpendicular) return false;
return true;
}
在TextChunk类中计算垂直距离:
public TextChunk(String str, Vector startLocation, Vector endLocation, float charSpaceWidth) {
this.text = str;
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length == 0) {
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude = (int)(Math.Atan2(orientationVector[Vector.I2], orientationVector[Vector.I1])*1000);
// see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
// the two vectors we are crossing are in the same plane, so the result will be purely
// in the z-axis (out of plane) direction, so we just take the I3 component of the result
Vector origin = new Vector(0,0,1);
distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector)[Vector.I3];
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
如果我做locationalResult.Sort(),文档中的卡盘会与其他卡盘混合,因为数据看起来没有顺序。在其他PDF中,工作的人有方向向量(1,0,0)。不同之处在于,startLocation和endLocation没有相同的y因子。似乎有点高。
有人能告诉我哪里出了问题?我如何更正这些值以获得同一行中的所有字符
文档是面向横向的,块具有相同的X组件,但Y更改如下: 只有改变X和Y坐标才能工作
Function GetCharacterRenderInfos() As List(Of CustomTextRenderInfo)
Dim baseList As IList(Of TextRenderInfo) = Me.BaseInfo.GetCharacterRenderInfos()
Dim caracteres() As Char = Me.GetText().ToCharArray()
Dim vStart As Vector = Me.BaseLine.GetStartPoint()
Dim vEnd As Vector = Me.BaseLine.GetEndPoint()
Dim x As Single = vStart(Vector.I1)
Dim y As Single = vStart(Vector.I2)
Dim z As Single = vStart(Vector.I3)
Dim y2 As Single = vEnd(Vector.I2)
If (x.Equals(vEnd(Vector.I1))) Then 'This case
x = vStart(Vector.I2)
y = 2000 - vStart(Vector.I1) 'Because the rigthmost column must be on top
y2 = 2000 - vEnd(Vector.I1)
End If
If x < 0 And y > 0 Then
x = 0
End If
函数getCharacterRenderInfo()作为(CustomTextRenderInfo的)列表
Dim baseList作为IList(TextRenderInfo的)=Me.BaseInfo.getCharacterRenderInfo()
Dim caracteres()作为Char=Me.GetText().ToCharray()的形式
Dim vStart As Vector=Me.BaseLine.GetStartPoint()
Dim vEnd As Vector=Me.BaseLine.GetEndPoint()
尺寸x为单个=vStart(向量I1)
尺寸y为单个=vStart(向量I2)
尺寸z为单个=vStart(向量I3)
尺寸y2为单个=供应商(矢量I2)
如果(x.Equals(vEnd(Vector.I1)),则“此情况
x=vStart(Vector.I2)
y=2000-vStart(Vector.I1)“,因为RightHost列必须位于顶部
y2=2000-vEnd(向量I1)
如果结束
如果x<0且y>0,则
x=0
如果结束
也许是另一种解决方案,但这对我很有效。再次感谢。你能提供有问题的PDF吗?事实上,有问题的PDF是分析这个问题所必需的。我从PDF上传了一个页面,而不是图片。我无法在PDF之前发布。我只是从当前的iText应用了
LocationTextExtractionStrategy
ion和提取的所有内容都很好。因此,您的LocationTextExtractionStrategy
扩展可能在某个地方引入了错误。顺便说一句,iTextSharp当前托管在github上。sourceforge存储库正在变旧。看起来您的策略中的原始块被分解为单个字符块,而这进程出错,似乎没有考虑页面旋转。