Android 阿拉伯文PDF文本提取器
是否有任何pdf文本提取器api可以从pdf中提取阿拉伯文本 我使用的是itextpdf api,它可以很好地提取英语,但不能提取阿拉伯语文本。Android 阿拉伯文PDF文本提取器,android,itext,Android,Itext,是否有任何pdf文本提取器api可以从pdf中提取阿拉伯文本 我使用的是itextpdf api,它可以很好地提取英语,但不能提取阿拉伯语文本。 这是我在pdf中提取文本的代码: private String extractPDF(String path) throws IOException { String parsedText = ""; PdfReader reader = new PdfReader(path); int n = rea
这是我在pdf中提取文本的代码:
private String extractPDF(String path) throws IOException {
String parsedText = "";
PdfReader reader = new PdfReader(path);
int n = reader.getNumberOfPages();
for (int page = 0; page < n; page++) {
parsedText = parsedText + PdfTextExtractor.getTextFromPage(reader, page + 1).trim() + "\n"; //Extracting the content from the different pages
}
reader.close();
return parsedText;
}
private String extractPDF(字符串路径)引发IOException{
字符串parsedText=“”;
PdfReader reader=新PdfReader(路径);
int n=reader.getNumberOfPages();
对于(int page=0;page
这是输入pdf:
更新:
我能够提取阿拉伯语文本,但它不保留行的顺序,这是我的代码:
private String extractPDF(String name) throws IOException {
PdfReader reader = new PdfReader(name);
StringBuilder text = new StringBuilder();
for (int i=1;i<=reader.getNumberOfPages();i++){
String data = PdfTextExtractor.getTextFromPage(reader,i,new SimpleTextExtractionStrategy());
text.append(Bidi.BidiText(data,1).getText());
}
return text.toString();
}
private String extractPDF(字符串名称)引发IOException{
PdfReader reader=新PdfReader(名称);
StringBuilder text=新的StringBuilder();
对于(int i=1;i您的示例PDF根本不包含任何文本,它只包含嵌入的文本位图图像
当谈到“从PDF中提取文本”(以及“文本提取器API”和pdftextractor
classes等)时,通常指的是在PDF中查找文本绘制说明(PDF查看器使用嵌入在PDF中的字体程序或在手上的系统上提供的字体程序来显示文本)以及根据字符串参数和字体编码定义确定文本内容
在您的情况下,没有这样的文本绘制指令,只有位图绘制指令和位图本身,从文档中提取文本将返回空字符串
要检索文档中显示的文本,您必须寻找OCR(光学字符识别)解决方案。如果OCR解决方案不直接支持PDF,而只支持位图格式,则PDF库(如iText)可以帮助您提取嵌入的位图图像以转发到OCR解决方案
如果您还有PDF文档,这些文档使用具有足够编码信息的文本绘制说明(而不是位图)显示阿拉伯语文本,则可能需要使用中建议的Convert
等方法对iText的文本提取输出进行后处理,正如Amedee在对您的问题的评论中指出的那样。(是的,它是用C语言编写的,但是移植到Java非常容易。)pdfbox是否提取阿拉伯文本?对于StackOverflow,询问软件建议是离题的。请尝试但确保您的问题在那里,然后才能使用iText发布阿拉伯文本提取。此问题与此问题重复,因为错误的原因被标记为离题(询问建议)。请投票重新打开,以便它可以以正确的理由再次关闭。@AmedeeVanGasse我按照链接制作了相同的java类,但它仍然无法识别要提取的阿拉伯文本。您知道从pdf中提取阿拉伯文本的任何方法吗?非常感谢您,先生。它正在与我合作,但它以相反的方式显示阿拉伯文本该方的订单PDF文本是:“151575日日方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方السل㶋م”问题解决了,但现在我面临另一个问题,如果文件有多行,它将显示最后一行,并一直持续到第一行,因为它是最后一行中的第一行,在出现这些问题时,您现在使用的是哪种文本检索技术?我使用StringBuilder附加文本,我编辑代码。
public static BidiResult BidiText(String str, int startLevel)
{
boolean isLtr = true;
int strLength = str.length();
if (strLength == 0)
{
return new BidiResult(str, false);
}
// get types, fill arrays
char[] chars = new char[strLength];
String[] types = new String[strLength];
String[] oldtypes = new String[strLength];
int numBidi = 0;
for (int i = 0; i < strLength; ++i)
{
chars[i] = str.charAt(i);
char charCode = str.charAt(i);
String charType = "L";
if (charCode <= 0x00ff)
{
charType = BaseTypes[charCode];
}
else if (0x0590 <= charCode && charCode <= 0x05f4)
{
charType = "R";
}
else if (0x0600 <= charCode && charCode <= 0x06ff)
{
charType = ArabicTypes[charCode & 0xff];
}
else if (0x0700 <= charCode && charCode <= 0x08AC)
{
charType = "AL";
}
if (charType.equals("R") || charType.equals("AL") || charType.equals("AN"))
{
numBidi++;
}
oldtypes[i] = types[i] = charType;
}
if (numBidi == 0)
{
return new BidiResult(str, true);
}
if (startLevel == -1)
{
if ((strLength / numBidi) < 0.3)
{
startLevel = 0;
}
else
{
isLtr = false;
startLevel = 1;
}
}
int[] levels = new int[strLength];
for (int i = 0; i < strLength; ++i)
{
levels[i] = startLevel;
}
String e = IsOdd(startLevel) ? "R" : "L";
String sor = e;
String eor = sor;
String lastType = sor;
for (int i = 0; i < strLength; ++i)
{
if (types[i].equals("NSM"))
{
types[i] = lastType;
}
else
{
lastType = types[i];
}
}
lastType = sor;
for (int i = 0; i < strLength; ++i)
{
String t = types[i];
if (t.equals("EN"))
{
types[i] = (lastType.equals("AL")) ? "AN" : "EN";
}
else if (t.equals("R") || t.equals("L") || t.equals("AL"))
{
lastType = t;
}
}
for (int i = 0; i < strLength; ++i)
{
String t = types[i];
if (t.equals("AL"))
{
types[i] = "R";
}
}
for (int i = 1; i < strLength - 1; ++i)
{
if (types[i].equals("ES") && types[i - 1].equals("EN") && types[i + 1].equals("EN"))
{
types[i] = "EN";
}
if (types[i].equals("CS") && (types[i - 1].equals("EN") || types[i - 1].equals("AN")) && types[i + 1] == types[i - 1])
{
types[i] = types[i - 1];
}
}
for (int i = 0; i < strLength; ++i)
{
if (types[i].equals("EN"))
{
// do before
for (int j = i - 1; j >= 0; --j)
{
if (!types[j].equals("ET"))
{
break;
}
types[j] = "EN";
}
// do after
for (int j = i + 1; j < strLength; --j)
{
if (!types[j].equals("ET"))
{
break;
}
types[j] = "EN";
}
}
}
for (int i = 0; i < strLength; ++i)
{
String t = types[i];
if (t.equals("WS") || t.equals("ES") || t.equals("ET") || t.equals("CS"))
{
types[i] = "ON";
}
}
lastType = sor;
for (int i = 0; i < strLength; ++i)
{
String t = types[i];
if (t.equals("EN"))
{
types[i] = (lastType.equals("L")) ? "L" : "EN";
}
else if (t.equals("R") || t.equals("L"))
{
lastType = t;
}
}
for (int i = 0; i < strLength; ++i)
{
if (types[i].equals("ON"))
{
int end = FindUnequal(types, i + 1, "ON");
String before = sor;
if (i > 0)
{
before = types[i - 1];
}
String after = eor;
if (end + 1 < strLength)
{
after = types[end + 1];
}
if (!before.equals("L"))
{
before = "R";
}
if (!after.equals("L"))
{
after = "R";
}
if (before == after)
{
SetValues(types, i, end, before);
}
i = end - 1; // reset to end (-1 so next iteration is ok)
}
}
for (int i = 0; i < strLength; ++i)
{
if (types[i].equals("ON"))
{
types[i] = e;
}
}
for (int i = 0; i < strLength; ++i)
{
String t = types[i];
if (IsEven(levels[i]))
{
if (t.equals("R"))
{
levels[i] += 1;
}
else if (t.equals("AN") || t.equals("EN"))
{
levels[i] += 2;
}
}
else
{
if (t.equals("L") || t.equals("AN") || t.equals("EN"))
{
levels[i] += 1;
}
}
}
int highestLevel = -1;
int lowestOddLevel = 99;
int ii = levels.length;
for (int i = 0; i < ii; ++i)
{
int level = levels[i];
if (highestLevel < level)
{
highestLevel = level;
}
if (lowestOddLevel > level && IsOdd(level))
{
lowestOddLevel = level;
}
}
for (int level = highestLevel; level >= lowestOddLevel; --level)
{
int start = -1;
ii = levels.length;
for (int i = 0; i < ii; ++i)
{
if (levels[i] < level)
{
if (start >= 0)
{
chars = ReverseValues(chars, start, i);
start = -1;
}
}
else if (start < 0)
{
start = i;
}
}
if (start >= 0)
{
chars = ReverseValues(chars, start, levels.length);
}
}
String result = "";
ii = chars.length;
for (int i = 0; i < ii; ++i)
{
char ch = chars[i];
if (ch != '<' && ch != '>')
{
result += ch;
}
}
return new BidiResult(result, isLtr);
}