Perl词干分析英文文本_Perl_File_Porter Stemmer

Perl词干分析英文文本

perl file

Perl词干分析英文文本,perl,file,porter-stemmer,Perl,File,Porter Stemmer,我正试图阻止一个英文文本，我读了很多论坛，但我没有看到一个明确的例子。我使用波特词干分析器，就像使用Text:：ENglish一样。这就是我取得的成绩：使用Lingua:：StopWords qw（getStopWords）； my$stopwords=getStopWords（'en'）；使用文本：：英语； @stopwords=grep{$stopwords->{${}（键%$stopwords）； chdir（“c:/Test Facility/input”）； @文件=； fore

我正试图阻止一个英文文本，我读了很多论坛，但我没有看到一个明确的例子。我使用波特词干分析器，就像使用Text:：ENglish一样。这就是我取得的成绩：

使用Lingua:：StopWords qw（getStopWords）；
my$stopwords=getStopWords（'en'）；
使用文本：：英语；
@stopwords=grep{$stopwords->{${}（键%$stopwords）；
chdir（“c:/Test Facility/input”）；
@文件=；
foreach$file（@files）
{
打开（输入$file）；
而（）
{
打开（输出“>>c:/testfacility/normalized/”$文件）；
咀嚼；
我的$w（@stopwords）
{
s/\b\Q$w\E\b//ig；
}
$\uz=~s/]*>//g；
$\=~s/[:punct:][]//g；
##我应该在这里写些什么来使用Text:：English应用波特词干分析##
打印输出“$\un”；
}
}
关闭（输入）；
关闭（输出）；

像这样运行以下代码：

perl-stemmer.pl/usr/lib/jvm/java-6-sun-1.6.0.26/jre/LICENSE

它产生的输出类似于：

operat system distributor许可java版本sun microsystems inc sun愿意许可java平台标准版开发工具包jdk

请注意，除了停止字之外，长度为1和数值的字符串也将被删除

#!/usr/bin/env perl
use common::sense;

use Encode;
use Lingua::Stem::Snowball;
use Lingua::StopWords qw(getStopWords);
use Scalar::Util qw(looks_like_number);

my $stemmer = Lingua::Stem::Snowball->new(
    encoding    => 'UTF-8',
    lang        => 'en',
);

my %stopwords = map {
    lc
} keys %{getStopWords(en => 'UTF-8')};

local $, = ' ';
say map {
    sub {
        my @w =
            map {
                encode_utf8 $_
            } grep {
                length >= 2
                and not looks_like_number($_)
                and not exists $stopwords{lc($_)}
            } split
                /[\W_]+/x,
                shift;

        $stemmer->stem_in_place(\@w);

        map {
            lc decode_utf8 $_
        } @w
    }->($_);
} <>;

#/usr/bin/env perl
使用常识；
使用编码；
使用语言：：茎：：雪球；
使用Lingua:：StopWords qw（getStopWords）；
使用Scalar:：Util qw（看起来像数字）；
my$stemmer=Lingua:：Stem:：Snowball->new(
编码=>'UTF-8'，
lang=>en，
);
我的%stopwords=地图{
信用证
}键%{getStopWords（en=>'UTF-8'）}；
当地元，=''；
说地图{
潜艇{
我的@w=
地图{
编码_utf8$_
}格雷普{
长度>=2
而且看起来不像数字（$）
并且不存在$stopwords{lc（$)}
}分裂
/[\W\]+/x，
转移；
$stemmer->stem\u在位置（\@w）；
地图{
lc解码$_
}@w
}->($_);
} ;

像这样运行以下代码：

perl-stemmer.pl/usr/lib/jvm/java-6-sun-1.6.0.26/jre/LICENSE

它产生的输出类似于：

operat system distributor许可java版本sun microsystems inc sun愿意许可java平台标准版开发工具包jdk

请注意，除了停止字之外，长度为1和数值的字符串也将被删除

#!/usr/bin/env perl
use common::sense;

use Encode;
use Lingua::Stem::Snowball;
use Lingua::StopWords qw(getStopWords);
use Scalar::Util qw(looks_like_number);

my $stemmer = Lingua::Stem::Snowball->new(
    encoding    => 'UTF-8',
    lang        => 'en',
);

my %stopwords = map {
    lc
} keys %{getStopWords(en => 'UTF-8')};

local $, = ' ';
say map {
    sub {
        my @w =
            map {
                encode_utf8 $_
            } grep {
                length >= 2
                and not looks_like_number($_)
                and not exists $stopwords{lc($_)}
            } split
                /[\W_]+/x,
                shift;

        $stemmer->stem_in_place(\@w);

        map {
            lc decode_utf8 $_
        } @w
    }->($_);
} <>;

#/usr/bin/env perl
使用常识；
使用编码；
使用语言：：茎：：雪球；
使用Lingua:：StopWords qw（getStopWords）；
使用Scalar:：Util qw（看起来像数字）；
my$stemmer=Lingua:：Stem:：Snowball->new(
编码=>'UTF-8'，
lang=>en，
);
我的%stopwords=地图{
信用证
}键%{getStopWords（en=>'UTF-8'）}；
当地元，=''；
说地图{
潜艇{
我的@w=
地图{
编码_utf8$_
}格雷普{
长度>=2
而且看起来不像数字（$）
并且不存在$stopwords{lc（$)}
}分裂
/[\W\]+/x，
转移；
$stemmer->stem\u在位置（\@w）；
地图{
lc解码$_
}@w
}->($_);
} ;