Windows上的Perl Image::OCR::Tesseract模块
有人知道在Windows上安装“Image::OCR::Tesseract”模块的优雅方法吗?由于名为“LEOCHARRE::CLI”的*NIX-only模块依赖关系,该模块无法通过CPAN在Windows上安装。运行“Image::OCR::Tesseract”本身似乎不需要此模块 我首先手动安装makefile.pl(除了“LEOCHARRE::CLI”)中列出的依赖模块,然后将模块文件移动到“C:\Perl\site\lib\Image\OCR”下的正确目录结构,从而使模块正常工作。使其工作的最后一部分是修改从命令行调用ImageMagick和Tesseract可执行文件的代码部分,以便在模块调用可执行文件时在程序名周围加引号Windows上的Perl Image::OCR::Tesseract模块,windows,perl,tesseract,Windows,Perl,Tesseract,有人知道在Windows上安装“Image::OCR::Tesseract”模块的优雅方法吗?由于名为“LEOCHARRE::CLI”的*NIX-only模块依赖关系,该模块无法通过CPAN在Windows上安装。运行“Image::OCR::Tesseract”本身似乎不需要此模块 我首先手动安装makefile.pl(除了“LEOCHARRE::CLI”)中列出的依赖模块,然后将模块文件移动到“C:\Perl\site\lib\Image\OCR”下的正确目录结构,从而使模块正常工作。使其工
这是可行的,但如果在Windows上运行的repo上在生产系统上安装PPM或CPAN,我会感觉更好。没关系,我知道了,尽管我无法决定什么是更好的解决方案 要使安装程序通过传统的“perl makefile.pl,make,make test,make install”例程在Windows上运行,需要编辑makefile.pl脚本,包括缺少的Windows安装模块(Devel::AssertOS::MSWin32),并修补AssertEXE.pm以使用“File::What”而不是内置shell“What”Windows缺少的命令。所有这些仍然需要对“Image::OCR::Tesseract”进行修补,以便在从命令行执行“convert”和“Tesseract”时在程序名周围加上引号 考虑到让安装程序在Windows上工作所涉及的步骤数量,以及模块没有为模块创建要链接到的二进制组件这一事实,我认为在Windows上安装和使用Tesseract模块的最佳选择是首先安装以下二进制包: ImageMagick 特塞拉特 接下来,找到您的Perl模块目录——在我的系统上是“C:\Perl\site\lib”。创建一个文件夹“图像”,如果你没有。接下来,打开图像文件夹并创建一个名为“OCR”的文件夹。打开OCR文件夹。此时,您的路径应该是“C:\Perl\site\lib\Image\OCR”的路径。创建一个名为“Tesseract.pm”的新文本文件,并复制以下内容
package Image::OCR::Tesseract;
use strict;
use Carp;
use Cwd;
use String::ShellQuote 'shell_quote';
use Exporter;
use vars qw(@EXPORT_OK @ISA $VERSION $DEBUG $WHICH_TESSERACT $WHICH_CONVERT %EXPORT_TAGS @TRASH);
@ISA = qw(Exporter);
@EXPORT_OK = qw(get_ocr get_hocr _tesseract convert_8bpp_tif tesseract);
$VERSION = sprintf "%d.%02d", q$Revision: 1.24 $ =~ /(\d+)/g;
%EXPORT_TAGS = ( all => \@EXPORT_OK );
BEGIN {
use File::Which 'which';
$WHICH_TESSERACT = which('tesseract');
$WHICH_CONVERT = which('convert');
if($^O=~m/MSWin/) {
$WHICH_TESSERACT='"'.$WHICH_TESSERACT.'"';
$WHICH_CONVERT='"'.$WHICH_CONVERT.'"';
}
$WHICH_TESSERACT or die("Is tesseract installed? Cannot find bin path to tesseract.");
$WHICH_CONVERT or die("Is convert installed? Cannot find bin path to convert.");
}
END {
scalar @TRASH or return;
if ( $DEBUG ){
print STDERR "Debug on, these are trash files:\n".join("\n",@TRASH) ;
}
else {
unlink @TRASH;
}
}
sub DEBUG { Carp::cluck("Image::OCR::Tesseract::DEBUG() deprecated") }
sub get_hocr {
my ($abs_image,$abs_tmp_dir,$lang)= @_;
-f $abs_image or croak("$abs_image is not a file on disk");
my $hocr="hocr";
if(defined $abs_tmp_dir){
-d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");
$abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
my $abs_copy = "$abs_tmp_dir/$1";
# TODO, what if source and dest are same, i want it to die
require File::Copy;
File::Copy::copy($abs_image, $abs_copy)
or die("cant make copy of $abs_image to $abs_copy, $!");
# change the image to get ocr from to be the copy
$abs_image = $abs_copy;
# since it's a copy. erase that on exit
push @TRASH, $abs_image;
}
my $tmp_tif = convert_8bpp_tif($abs_image);
push @TRASH, $tmp_tif; # for later delete
_tesseract($tmp_tif,$lang,$hocr) || '';
}
sub get_ocr {
my ($abs_image,$abs_tmp_dir,$lang)= @_;
-f $abs_image or croak("$abs_image is not a file on disk");
if(defined $abs_tmp_dir){
-d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");
$abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
my $abs_copy = "$abs_tmp_dir/$1";
# TODO, what if source and dest are same, i want it to die
require File::Copy;
File::Copy::copy($abs_image, $abs_copy)
or die("cant make copy of $abs_image to $abs_copy, $!");
# change the image to get ocr from to be the copy
$abs_image = $abs_copy;
# since it's a copy. erase that on exit
push @TRASH, $abs_image;
}
my $tmp_tif = convert_8bpp_tif($abs_image);
push @TRASH, $tmp_tif; # for later delete
_tesseract($tmp_tif,$lang) || '';
}
sub convert_8bpp_tif {
my ($abs_img,$abs_out) = (shift,shift);
defined $abs_img or die('missing image arg');
$abs_out ||= $abs_img.'.tmp.'.time().(int rand(9000)).'.tif';
my @arg = ( $WHICH_CONVERT, $abs_img, '-compress','none','+matte', $abs_out );
#die (join(" ", @arg));
system(@arg) == 0 or die("convert $abs_img error.. $?");
$DEBUG and warn("made $abs_out 8bpp tiff.");
$abs_out;
}
# people expect tesseract to automatically convert
*tesseract = \&_tesseract;
sub _tesseract {
my ($abs_image,$lang,$hocr) = @_;
defined $abs_image or croak('missing image path arg');
$abs_image=~/\.tif+$/i or warn("Are you sure '$abs_image' is a tif image? This operation may fail.");
#my @arg = (
# $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image),
# (defined $lang and ('-l', $lang) ), '2>/dev/null'
#);
my $cmd =
( sprintf '%s %s %s',
$WHICH_TESSERACT,
shell_quote($abs_image),
shell_quote($abs_image)
) .
( defined $lang ? " -l $lang" : '' ) .
( defined $hocr ? " hocr" : '' ) .
" 2>/dev/null";
$DEBUG and warn "command: $cmd";
system($cmd); # hard to check ==0
my $txt = $abs_image.($hocr?".html":".txt");
unless( -f $txt ){
Carp::cluck("no text output for image '$abs_image'. (No text file '$txt' found on disk)");
return;
}
$DEBUG and warn "Found text file '$txt'";
my $content = (_slurp($txt) || '');
$DEBUG and warn("content length of text in '$txt' from image '$abs_image' is ". length $content );
push @TRASH, $txt;
$content;
}
sub _slurp {
my $abs = shift;
open(FILE,'<', $abs) or die("can't open file for reading '$abs', $!");
local $/;
my $txt = <FILE>;
close FILE;
$txt;
}
1;
__END__
#sub _force_imgtype {
# my $img = shift;
# my $type = shift;
# my $delete_original = shift;
# $delete_original ||=0;
#
#
# if($img=~/\.$type$/i){
# return $img;
# }
#
# my $img_out= $img;
# $img_out=~s/\.\w{1,5}$/\.$type/ or die("cant get file ext for $img");
#
#
#
#}
就这样。凌乱,我知道,但是没有好的方法可以绕过这样一个事实:模块安装程序确实需要更新以支持Windows(和其他)系统,即使实际的模块代码几乎不需要修改就可以运行。实际上,如果将Tesseract和ImageMagick安装到没有空格的路径,则“Image::OCR::Tesseract”模块代码将不需要任何更改,但这一微小的调整使支持的可执行文件可以安装在任何位置,包括默认位置。您可能应该在上分享您的发现。如果我解决了一些小问题,我将在接下来的几天内尝试直接与作者联系。他在CPAN上的反应似乎不是很积极,但我今天早些时候找到了他的主页。进一步更新:我在发布此问题后不久就试图联系模块作者,但没有任何回应。可悲的是,我认为我们必须考虑这个Perl模块是在废弃的范畴。瑞克:你能推荐一个好的和积极的Perl模块吗?我正在启动一个项目,在这个项目中,我必须从图像中读取文本。即使在当前“被放弃”的状态下,图像::OCR::Tesseract似乎仍然是最好的选择(对于Perl)。实际上,由于命令行Tesseract可执行文件正在为您完成大部分工作,因此编写这种类型的模块并不难。一组新代码需要做的就是处理图像到TIFF的转换,将正确的参数传递给Tesseract,然后收集并返回输出。
use Image::OCR::Tesseract;
my $image = 'SomeImageFileThatContainsText.jpg';
my $text = Image::OCR::Tesseract::get_ocr($image);
print "Text...\n";
print $text."\n";
print "Normal Exit\n";
exit;