Ruby 删除文件中与第二个文件不匹配的行的最快方法是什么？_Ruby_Perl_Bash_Python 2.7_Cjk

Ruby 删除文件中与第二个文件不匹配的行的最快方法是什么？

ruby perl bash python-2.7

Ruby 删除文件中与第二个文件不匹配的行的最快方法是什么？,ruby,perl,bash,python-2.7,cjk,Ruby,Perl,Bash,Python 2.7,Cjk,我有两个文件，wordlist.txt和text.txt 第一个文件，wordlist.txt，包含大量中文、日文和韩文单词列表，例如：你你们我第二个文件text.txt，包含长段落，例如：你们要去哪里？卡拉OK好不好？我想创建一个新的单词列表（wordsfount.txt），但它应该只包含wordlist.txt中至少在text.txt中找到一次的那些行。上面的输出文件应显示以下内容：你你们 "我" 在该列表中找不到，因为它从未在text.txt中找到我想找到一种非常

我有两个文件，

wordlist.txt

和

text.txt

第一个文件，

wordlist.txt

，包含大量中文、日文和韩文单词列表，例如：

你
你们
我

第二个文件

text.txt

，包含长段落，例如：

你们要去哪里？
卡拉OK好不好？

我想创建一个新的单词列表（

wordsfount.txt

），但它应该只包含

wordlist.txt

中至少在

text.txt

中找到一次的那些行。上面的输出文件应显示以下内容：

你
你们

"我" 在该列表中找不到，因为它从未在

text.txt

中找到

我想找到一种非常快速的方法来创建这个列表，它只包含在第二个文件中找到的第一个文件中的行

我知道在BASH中使用

grep

检查

worlist.txt

中的每一行并查看它是否在

text.txt

中的一种简单方法：

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

由于这一事实，如果“我在

text.txt中找不到，那么这是非常合理的我们“也从来没有出现过。可能需要一个更快的脚本来检查”我“首先，在发现它不存在时，可以避免检查wordlist.txt
中包含的每个后续单词，这些单词也包含在wordlist.txt
中。如果在wordlist.txt
中找到大约8000个唯一字符，那么脚本就不需要检查那么多行
创建列表的最快方法是什么？该列表只包含第一个文件中的单词，而第二个文件中也包含这些单词？
使用bash脚本的最简单方法：
new file newlist.txt
for each word in wordlist.txt:
    check if word is in text.txt (I would use grep, if you're willing to use bash)
    if yes:
        append it to newlist.txt (probably echo word >> newlist.txt)
    if no:
        next word

先用“tr”和“sort”进行预处理，将其格式化为一行一个字，并删除重复的行
这样做：
cat wordlist.txt |在读取时i；do grep-E“^$i$”text.txt；完成
这是你想要的单词列表…
只需使用comm

comm-1 wordlist.txt text.txt
这可能适合您：
 tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt

使用带有固定字符串（-F
）语义的grep，这将是最快的
sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt

我很惊讶已经有四个答案了，但是还没有人发布。人们只是不知道他们的工具箱了。试试这个：
cat wordlist.txt |在读取行时
做
如果[[grep-wc$line text.txt-gt 0]]
然后
回音$线
fi
完成
无论您做什么，如果您使用grep，您必须使用-w来匹配整个单词。否则，如果您在wordlist.txt中有foo，在text.txt中有foobar，您将得到错误的匹配
如果文件非常大，并且此循环运行时间太长，您可以将text.txt转换为工作列表（使用AWK很容易），并使用comm查找两个列表中的单词。
非常确定这不是最快的解决方案，但至少是可行的（我希望如此）
这个解决方案需要ruby 1.9，文本文件应该是UTF-8
#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

new_wordlist = []
$wordlist.each{|word|
  new_wordlist << word if $txt.include?(word)
}

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  f << new_wordlist.join("\n")
}

编码：utf-8
#获取测试数据
$wordlist=File.readlines（'wordlist.txt'，：encoding=>'utf-8'）.map{x | x.strip}
$txt=File.read（'text.txt'，：encoding=>'utf-8'）
新单词表=[]
$wordlist.每个{单词|
新单词列表“utf-8”）
def溶液计数
新单词表=[]
$wordlist.每个{单词|
新单词列表0
}
new_wordlist.sort
结束
#比计数更快，它可以在第一次命中后停止
def解决方案包括
新单词表=[]
$wordlist.每个{单词|
新单词列表max}
#从文本生成所有字母组合的列表
文字（在文本中）=[]
0.高达（$txt.size）{i|
1.最高（最大）{1|
我可能会使用Perl
use strict;

my @aWordList = ();

open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);

while(my $sWord = <WORDLIST>)
{
   chomp($sWord);
   push(@aWordList, $sWord);
}

close(WORDLIST);

open(TEXT, "< text.txt") || die("Can't open text.txt);

while(my $sText = <TEXT>)
{
   foreach my $sWord (@aWordList)
   {
      if($sText =~ /$sWord/)
      {
          print("$sWord\n");
      }
   }
}


close(TEXT);

使用严格；
我的@aWordList=（）；
打开（WORDLIST，“

这不会太慢，但是如果您能让我们知道您正在处理的文件的大小，我可以尝试使用哈希表编写一些更聪明的东西
第一个TXR Lisp解决方案（）：
（TXR读取UTF-8并以Unicode进行所有字符串操作，因此使用ASCII字符进行测试是有效的。）
例如，使用惰性列表意味着我们不存储包含300000个单词的整个列表。尽管我们使用Lispmapcar
函数，但该列表是动态生成的，因为我们没有保留对列表头的引用，所以可以进行垃圾收集
不幸的是，我们必须将文本语料库保存在内存中，因为哈希表关联行
如果这是一个问题，解决方案可以颠倒过来。扫描所有单词，然后懒洋洋地处理文本语料库，标记出现的单词。然后删除其余的。我也将发布这样一个解决方案。
此解决方案使用perl，保持您原来的符号，并使用您建议的优化
#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
    my $j = $i+1;
    while ($list[$j]=~/^$list[$i]/) {
            ++$j;
    }
    $skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
    for ($i=0;$i<$size;++$i) {
            if ($c=~/$list[$i]/) {
                    $found{$list[$i]}=1;
                    last;
            }
            else {
                    $i+=$skip[$i];
            }
    }
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
        print OUT "$key\n";
}
close OUT;
exit;

！/usr/bin/perl
@list=split（“\n”，sort<./wordlist.txt | uniq`）；
$size=标量（@list）；
对于（$i=0；$i使用并行处理来加速处理
1） 在wordlist.txt上排序&uniq，然后将其拆分为多个文件（X）
做一些测试，X等于你的计算机核心
 split -d -l wordlist.txt

2） 使用xargs-px-n1script.shx00>output-x00.txt
并行处理文件的步骤
 find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE

3） cat output*>output.txt连接输出文件
这将大大加快处理速度，并且您能够使用您能够理解的工具。这将降低维护“成本”
脚本与您最初使用的脚本几乎相同
script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
    c=`grep -c $line ${FILE} `
    if [ "$c" -ge 1 ]
    then
    echo $line >> ${OUTPUTFILE}
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < ${WORDLIST}

script.sh
文件=$1
OUTPUTFILE=“output-${FILE}.txt”
WORDLIST=“wordliist.txt”
a=1
读行时
做
c=`grep-c$行${FILE}`
如果[“$c”-通用电气1]
然后
回音$l
$ txr words.tl words.txt text.txt
water
fire
earth
the

$ cat words.txt
water
fire
earth
the
it

$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth

#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
    my $j = $i+1;
    while ($list[$j]=~/^$list[$i]/) {
            ++$j;
    }
    $skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
    for ($i=0;$i<$size;++$i) {
            if ($c=~/$list[$i]/) {
                    $found{$list[$i]}=1;
                    last;
            }
            else {
                    $i+=$skip[$i];
            }
    }
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
        print OUT "$key\n";
}
close OUT;
exit;

 split -d -l wordlist.txt

 find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE

script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
    c=`grep -c $line ${FILE} `
    if [ "$c" -ge 1 ]
    then
    echo $line >> ${OUTPUTFILE}
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < ${WORDLIST}

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt