String 如何拆分多个连词？_String_Nlp

String 如何拆分多个连词？

string nlp

String 如何拆分多个连词？,string,nlp,String,Nlp,我有一个大约1000个条目的数组，示例如下： wickedweather liquidweather driveourtrucks gocompact slimprojector 我希望能够将其分为以下几个词： wicked weather liquid weather drive our trucks go compact slim projector 我希望我的正则表达式能起作用。但是，既然没有边界可停，也没有任何形式的大写字母可供我输入，我想，可能有必要参考字典我想它可以手工完成，但

我有一个大约1000个条目的数组，示例如下：

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

我希望能够将其分为以下几个词：

wicked weather
liquid weather
drive our trucks
go compact
slim projector

我希望我的正则表达式能起作用。但是，既然没有边界可停，也没有任何形式的大写字母可供我输入，我想，可能有必要参考字典

我想它可以手工完成，但为什么-当它可以用代码完成时！=）但这让我感到困惑。有什么想法吗

我认为你认为这不是正则表达式的工作，这是对的。我会使用字典的思想来处理这个问题——在字典中查找单词的最长前缀。当你找到它时，把它切掉，然后用剩下的绳子做同样的事情

上述方法存在歧义，例如“DriveRealFast”会首先找到“driver”，然后遇到“eallyfast”问题。因此，如果遇到这种情况，您还必须进行一些回溯。或者，由于您没有那么多字符串要拆分，只需手动执行自动拆分失败的字符串即可。

好吧，仅使用正则表达式无法解决问题本身。一个解决方案（可能不是最好的）是获得一个字典，并为字典中的每个工作与列表中的每个单词进行正则表达式匹配，只要成功，就添加空格。当然这不会太快，但编程很容易，而且比手工操作要快

需要基于词典的解决方案。如果你有一个有限的单词词典，这可能会被简化，否则，构成其他单词前缀的单词将成为一个问题。

我可能对此感到沮丧，但让秘书来做这件事
您将在字典解决方案上花费比手动处理更多的时间。此外，您可能对解决方案没有100%的信心，因此您仍然需要手动关注它。
人类可以做到吗
farsidebag far sidebag farside bag far side bag 远侧袋远侧袋远侧袋远侧袋你不仅需要使用字典，还可能需要使用统计方法来找出最有可能的（或者，上帝禁止，你选择的人类语言的实际HMM…）
关于如何进行可能有用的统计，我请Peter Norvig博士谈谈，他在21行代码中解决了一个不同但相关的拼写检查问题：
（他确实有点作弊，把每个for循环折叠成一行……但仍然如此）
更新这个卡在我的脑子里了，所以我今天不得不把它生下来。这段代码与Robert Gamble描述的代码进行了类似的拆分，但随后它根据提供的字典文件中的词频对结果进行排序（现在应该是一些代表您的域或英语的文本。我使用了Norvig的big.txt，上面链接了，并向它添加了一本字典，以覆盖缺少的单词）
两个单词的组合在大多数情况下会超过三个单词的组合，除非频率差异很大

我在博客上发布了这段代码，并做了一些小改动
并在代码中写了一些关于底流错误的内容。。我很想安静地修复它，但我想这可能会帮助一些以前没有见过日志技巧的人：

输出您的文字，再加上我自己的一些文字——注意“orcore”会发生什么：
perl splitwords.pl big.txt单词答案：2种可能性 -小牛肉 -答案：艾尔 wickedweather：4种可能性 -恶劣天气 -我们嘲笑她 -灯芯天气 -我们对她怒目而视液体天气：6种可能性 -流质天气 -我们看着她 -晴朗的天气 -我们看了她一眼 -李渠天气预报 -李渠：我们在看她吗驾驶我们的卡车：1种可能性 -开我们的卡车 gocompact：1种可能性 -紧凑型 slimprojector：2种可能性 -超薄投影仪 -苗条项目或 orcore：3种可能性 -或核心 -或co re -兽人矿石代码：

#/usr/bin/env perl 严格使用；使用警告；子查找_匹配项（$）；子查找匹配记录（$\@）；子查找单词顺序分数（@）； sub get_word_统计数据（$）；子打印结果（美元）；子用法（）；我们的百分比（%DICT，$TOTAL）； { my（$dict_file，$word_file）=@ARGV；（$dict_file&&$word_file）或die（用法）； { 我的$DICT；（$DICT，$TOTAL）=获取单词统计数据（$DICT\u文件）； %DICT=%$DICT； } { 打开（my$WORDS，这里最好的工具是递归，而不是正则表达式。基本思想是从字符串的开头开始寻找一个单词，然后从字符串的其余部分寻找另一个单词，依此类推，直到字符串的结尾。递归解决方案是自然的，因为需要进行回溯n当字符串的给定剩余部分无法拆分为一组单词时。下面的解决方案使用字典确定单词，并在找到它们时打印出解决方案（一些字符串可以拆分为多个可能的单词集，例如，wickedweather可以解析为“邪恶的我们看着她”）。如果您只需要一组单词，则需要确定选择最佳单词集的规则，方法可能是选择单词数最少的解决方案或设置最小单词长度 #!/usr/bin/perl use strict; my $WORD_FILE = '/usr/share/dict/words'; #Change as needed my %words; # Hash of words in dictionary # Open dictionary, load words into hash open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n"; while (<WORDS>) { chomp; $words{lc($_)} = 1; } close(WORDS); # Read one line at a time from stdin, break into words while (<>) { chomp; my @words; find_words(lc($_)); } sub find_words { # Print every way $string can be parsed into whole words my $string = shift; my @words = @_; my $length = length $string; foreach my $i ( 1 .. $length ) { my $word = substr $string, 0, $i; my $remainder = substr $string, $i, $length - $i; # Some dictionaries contain each letter as a word next if ($i == 1 && ($word ne "a" && $word ne "i")); if (defined($words{$word})) { push @words, $word; if ($remainder eq "") { print join(' ', @words), "\n"; return; } else { find_words($remainder, @words); } pop @words; } } return; } ！/usr/bin/perl 严格使用；我的$WORD_文件='/usr/share/dict/words'；#根据需要更改我的%words；#字典中的单词散列 #打开字典，将单词加载到哈希中打开（WORDS，$WORD\u文件）或死亡“无法打开字典：$！\n”；而（）{ 咀嚼； $words{lc（$)}=1； } 近（字）； #从stdin中一次读一行，分成几个单词而（）{ 咀嚼；我的文字；查找单词（lc（$）； } 子查找单词{ #打印所有可以将$string解析为完整单词的方式我的$string=shift；我的"文字",； my$length=长度$string； #!/usr/bin/env perl use strict; use warnings; sub find_matches($); sub find_matches_rec($\@\@); sub find_word_seq_score(@); sub get_word_stats($); sub print_results($@); sub Usage(); our(%DICT,$TOTAL); { my( $dict_file, $word_file ) = @ARGV; ($dict_file && $word_file) or die(Usage); { my $DICT; ($DICT, $TOTAL) = get_word_stats($dict_file); %DICT = %$DICT; } { open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n"; foreach my $word (<$WORDS>) { chomp $word; my $arr = find_matches($word); local $_; # Schwartzian Transform my @sorted_arr = map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [ $_, find_word_seq_score(@$_) ] } @$arr; print_results( $word, @sorted_arr ); } close $WORDS; } } sub find_matches($){ my( $string ) = @_; my @found_parses; my @words; find_matches_rec( $string, @words, @found_parses ); return @found_parses if wantarray; return \@found_parses; } sub find_matches_rec($\@\@){ my( $string, $words_sofar, $found_parses ) = @_; my $length = length $string; unless( $length ){ push @$found_parses, $words_sofar; return @$found_parses if wantarray; return $found_parses; } foreach my $i ( 2..$length ){ my $prefix = substr($string, 0, $i); my $suffix = substr($string, $i, $length-$i); if( exists $DICT{$prefix} ){ my @words = ( @$words_sofar, $prefix ); find_matches_rec( $suffix, @words, @$found_parses ); } } return @$found_parses if wantarray; return $found_parses; } ## Just a simple joint probability ## assumes independence between words, which is obviously untrue ## that's why this is broken out -- feel free to add better brains sub find_word_seq_score(@){ my( @words ) = @_; local $_; my $score = 1; foreach ( @words ){ $score = $score * $DICT{$_} / $TOTAL; } return $score; } sub get_word_stats($){ my ($filename) = @_; open(my $DICT, '<', $filename) or die "unable to open $filename\n"; local $/= undef; local $_; my %dict; my $total = 0; while ( <$DICT> ){ foreach ( split(/\b/, $_) ) { $dict{$_} += 1; $total++; } } close $DICT; return (\%dict, $total); } sub print_results($@){ #( 'word', [qw'test one'], [qw'test two'], ... ) my ($word, @combos) = @_; local $_; my $possible = scalar @combos; print "$word: $possible possibilities\n"; foreach (@combos) { print ' - ', join(' ', @$_), "\n"; } print "\n"; } sub Usage(){ return "$0 /path/to/dictionary /path/to/your_words"; } #!/usr/bin/perl use strict; my $WORD_FILE = '/usr/share/dict/words'; #Change as needed my %words; # Hash of words in dictionary # Open dictionary, load words into hash open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n"; while (<WORDS>) { chomp; $words{lc($_)} = 1; } close(WORDS); # Read one line at a time from stdin, break into words while (<>) { chomp; my @words; find_words(lc($_)); } sub find_words { # Print every way $string can be parsed into whole words my $string = shift; my @words = @_; my $length = length $string; foreach my $i ( 1 .. $length ) { my $word = substr $string, 0, $i; my $remainder = substr $string, $i, $length - $i; # Some dictionaries contain each letter as a word next if ($i == 1 && ($word ne "a" && $word ne "i")); if (defined($words{$word})) { push @words, $word; if ($remainder eq "") { print join(' ', @words), "\n"; return; } else { find_words($remainder, @words); } pop @words; } } return; } >>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks' # spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs mStartCData: ['m', 'Start', 'C', 'Data'] nonnegativedecimaltype: ['nonnegative', 'decimal', 'type'] getUtf8Octets: ['get', 'Utf8', 'Octets'] GPSmodule: ['GPS', 'module'] savefileas: ['save', 'file', 'as'] nbrOfbugs: ['nbr', 'Of', 'bugs'] # spiral wickedweather liquidweather driveourtrucks gocompact slimprojector wickedweather: ['wicked', 'weather'] liquidweather: ['liquid', 'weather'] driveourtrucks: ['driveourtrucks'] gocompact: ['go', 'compact'] slimprojector: ['slim', 'projector'] from mlmorph import Analyser analyser = Analyser() analyser.analyse("കേരളത്തിന്റെ") [('കേരളം<np><genitive>', 179)] $ echo thisisatest | python -m wordsegment this is a test >>> import wordninja >>> wordninja.split('bettergood') ['better', 'good'] function spinalCase(str) { let lowercase = str.trim() let regEx = /\W+|(?=[A-Z])|_/g let result = lowercase.split(regEx).join("-").toLowerCase() return result; } spinalCase("AllThe-small Things"); static List<String> wordBreak( String input, Set<String> dictionary ) { List<List<String>> result = new ArrayList<>(); List<String> r = new ArrayList<>(); helper(input, dictionary, result, "", 0, new Stack<>()); for (List<String> strings : result) { String s = String.join(" ", strings); r.add(s); } return r; } static void helper( final String input, final Set<String> dictionary, final List<List<String>> result, String state, int index, Stack<String> stack ) { if (index == input.length()) { // add the last word stack.push(state); for (String s : stack) { if (!dictionary.contains(s)) { return; } } result.add((List<String>) stack.clone()); return; } if (dictionary.contains(state)) { // bifurcate stack.push(state); helper(input, dictionary, result, "" + input.charAt(index), index + 1, stack); String pop = stack.pop(); String s = stack.pop(); helper(input, dictionary, result, s + pop.charAt(0), index + 1, stack); } else { helper(input, dictionary, result, state + input.charAt(index), index + 1, stack); } return; }