Python 有效地从文件B中删除包含字符串的文件A中的行_Python_Perl_Bash_Unix

Python 有效地从文件B中删除包含字符串的文件A中的行

python perl bash unix

Python 有效地从文件B中删除包含字符串的文件A中的行,python,perl,bash,unix,Python,Perl,Bash,Unix,文件A包含行文件B包含单词如何有效地从文件B中删除包含在文件A中找到的单词的行我尝试了以下方法，但我甚至不确定它们是否有效，因为运行时间太长了尝试了grep： grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out f = open(sys.argv[1],'r') out = open(sys.argv[2], 'w') bad_words = f.read().splitlines() with open('F

文件A包含行文件B包含单词

如何有效地从文件B中删除包含在文件A中找到的单词的行

我尝试了以下方法，但我甚至不确定它们是否有效，因为运行时间太长了

尝试了
grep
：

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

abadan
abaiara

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

FileA:

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

abadan
abaiara

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

FileB:

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

abadan
abaiara

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

所需输出：

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

abadan
abaiara

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

您使用的命令看起来不错，因此可能是时候尝试一种好的脚本语言了。尝试运行下面的

perl

脚本，看看它的报告速度是否更快

#!/usr/bin/perl

#use strict;
#use warnings;

open my $LOOKUP, "<", "fileA" or die "Cannot open lookup file: $!";
open my $MASTER, "<", "fileB" or die "Cannot open Master file: $!";
open my $OUTPUT, ">", "out" or die "Cannot create Output file: $!";

my %words;
my @l;

while (my $word = <$LOOKUP>) {
    chomp($word);
    ++$words{$word};
}

LOOP_FILE_B: while (my $line = <$MASTER>) {
    @l = split /\s+/, $line;
        for my $i (0 .. $#l) {
            if (defined $words{$l[$i]}) {
                next LOOP_FILE_B;
            }
        }
    print $OUTPUT "$line"
}

#/usr/bin/perl
#严格使用；
#使用警告；
打开我的$LOOKUP，“您使用的命令看起来不错，因此可能是时候尝试一种好的脚本语言了。尝试运行以下perl
脚本，看看它是否能更快地返回报告
#!/usr/bin/perl

#use strict;
#use warnings;

open my $LOOKUP, "<", "fileA" or die "Cannot open lookup file: $!";
open my $MASTER, "<", "fileB" or die "Cannot open Master file: $!";
open my $OUTPUT, ">", "out" or die "Cannot create Output file: $!";

my %words;
my @l;

while (my $word = <$LOOKUP>) {
    chomp($word);
    ++$words{$word};
}

LOOP_FILE_B: while (my $line = <$MASTER>) {
    @l = split /\s+/, $line;
        for my $i (0 .. $#l) {
            if (defined $words{$l[$i]}) {
                next LOOP_FILE_B;
            }
        }
    print $OUTPUT "$line"
}

！/usr/bin/perl
#严格使用；
#使用警告；
打开我的$LOOKUP，“我不相信Python在这方面至少比不上Perl。这是我在Python中快速尝试更有效地解决这个问题的一个版本。我用它来优化这个问题的搜索部分。&运算符返回一个新集合，其中包含两个集合共有的元素
这个解决方案需要12秒才能在我的机器上运行，一个文件a有3M行，文件B有200k个单词，perl需要9秒。最大的减速似乎是re.split，在本例中它似乎比string.split快
如果您对提高速度有任何建议，请对此答案进行评论
import re

filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')

output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)

splitter = re.compile("\s")
for line in filea:
    line_words = set(splitter.split(line))
    if bad_words.isdisjoint(line_words):
        output.write(line)

output.close()

我不相信Python在这方面至少比不上Perl。这是我在Python中快速尝试更有效地解决这个问题的一个版本。我用它来优化这个问题的搜索部分。&运算符返回一个新集合，其中包含两个集合共有的元素
这个解决方案需要12秒才能在我的机器上运行，一个文件a有3M行，文件B有200k个单词，perl需要9秒。最大的减速似乎是re.split，在本例中它似乎比string.split快
如果您对提高速度有任何建议，请对此答案进行评论
import re

filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')

output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)

splitter = re.compile("\s")
for line in filea:
    line_words = set(splitter.split(line))
    if bad_words.isdisjoint(line_words):
        output.write(line)

output.close()

使用grep
grep -v -Fwf fileB fileA

使用grep
grep -v -Fwf fileB fileA

你的文件有多大？FileA
带行的有3M行，FileB
带关键字的大约是200k@user1899415你能从两个文件的样本数据发布吗？此外，请确保您的文件没有windows格式。您可以使用dos2unix
实用程序将其转换为。@jaypal经过编辑以包含示例数据和所需输出将文件B加载到pythonset
或dict
将允许您进行更快的查找，因此我希望如果您这样做，结果会更好。顺便说一句，您在示例中混合了FileA和FileB。您的文件有多大？FileA
带行的有3M行，FileB
带关键字的大约是200k@user1899415你能从两个文件的样本数据发布吗？此外，请确保您的文件没有windows格式。您可以使用dos2unix
实用程序将其转换为。@jaypal经过编辑以包含示例数据和所需输出将文件B加载到pythonset
或dict
将允许您进行更快的查找，因此我希望如果您这样做，结果会更好。顺便说一句，您在示例中混合了FileA和FileB。哇，这太神奇了，它在大约10秒钟内运行并完成@user1899415我已对脚本进行了更改。第一个版本只考察与行首匹配的单词。根据你的样品数据，我误解了你的要求。更新后的版本将查找与Word文件中的单词匹配的整行。请再次运行测试，因为您的初始输出可能不正确。哇，太神奇了，它在大约10秒内运行并完成@user1899415我已对脚本进行了更改。第一个版本只考察与行首匹配的单词。根据你的样品数据，我误解了你的要求。更新后的版本将查找与Word文件中的单词匹配的整行。请重新运行测试，因为您的初始输出可能不正确。请确定。如果我还有电池的话。如果我的笔记本电脑坏了，我明天再做。perl解决方案正在吸收大量的资源（仍在运行）；）。。。如果要重新创建数据，请下载（100000行左右）并使用cat和>>进行复制。抓取一个单词列表，用head-c200k将其截断为200k。这应该可以得到与我测试的数据相同大小的数据。@jaypal我不保证Python解决方案的正确性。它适用于原始用户发布的示例数据。对于更大的数据集，我没有太仔细地查看它。显然，它也不考虑标点符号。在我的机器上，使用常规的spit而不是regex对象要快4倍line\u words=set（line.split（“\s”）
。我正在使用python2.7.3
@jaypal干得不错。。。更新后的perl脚本需要9秒钟。我尝试更新python解决方案以使用生成器，但这只节省了一秒钟。Python解决方案中最大的时间消耗是拆分字符串。有没有人想到如何优化这一部分？@jaypal别担心！您的解决方案仍处于领先地位/w 9秒。我阅读了Python中set模块的手册，每次使用ISDISJOIN都不需要调用len。这使我的解决方案缩短到12秒。我以后会再处理这件事。。。我拒绝让Perl获胜，即使只差3秒：）当然。如果我还有电池的话。如果我的笔记本电脑坏了，我明天再做。perl解决方案正在吸收大量的资源（仍在运行）；）。。。如果要重新创建数据，请下载（100000行左右）并使用cat和>>进行复制。抓取一个单词列表，用head-c200k将其截断为200k。这应该可以得到与我测试的数据相同大小的数据。@jaypal我知道