将Perl脚本转换为Python：基于哈希键删除2个重复文件_Python_Perl_Hash

将Perl脚本转换为Python：基于哈希键删除2个重复文件

python perl hash

将Perl脚本转换为Python：基于哈希键删除2个重复文件,python,perl,hash,Python,Perl,Hash,我是Python新手，想知道是否有人愿意将一个相当简单的Perl脚本示例转换为Python 该脚本获取2个文件，并通过比较哈希键，仅输出第二个文件中唯一的行。它还向文件输出重复的行。我发现这种重复数据消除方法在Perl中速度非常快，我想看看Python的比较结果 #! /usr/bin/perl ## Compare file1 and file2 and output only the unique lines from file2. ## Opening file1.txt and st

我是Python新手，想知道是否有人愿意将一个相当简单的Perl脚本示例转换为Python

该脚本获取2个文件，并通过比较哈希键，仅输出第二个文件中唯一的行。它还向文件输出重复的行。我发现这种重复数据消除方法在Perl中速度非常快，我想看看Python的比较结果

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
    my $name = $_;
    $file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while  ( <$file2> ) {
    $name = $_;
    $file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
    if ( exists ( $file2hash{$_} ))
    {
    print $dfh $file2hash{$_};
    delete $file2hash{$_};
    }
}

open my $ofh, '>', "file2_clean.txt";
print  $ofh values(%file2hash) ;

如果不关心顺序，可以在Python中使用集合：

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

其他方法包括使用difflib、filecmp库

另一种方法，只使用列表比较

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line

如果不关心顺序，可以在Python中使用集合：

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

其他方法包括使用difflib、filecmp库

另一种方法，只使用列表比较

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line

如果文件非常大，这里有一个稍微不同的解决方案，它对内存更友好。这只会为原始文件创建一个集合（因为似乎不需要一次将所有file2存储在内存中）：

注意，如果要在比较中包括尾随空格和行尾字符，可以将第二行

line.rstrip（）

替换为just

line

，并将第二行简化为：

    file1set = set(file1)

另外，在Python3.1中，

with

语句允许多个项，因此三个

with

语句可以组合成一个。

如果文件很大，这里有一个稍微不同的解决方案，它对内存更友好。这只会为原始文件创建一个集合（因为似乎不需要一次将所有file2存储在内存中）：

注意，如果要在比较中包括尾随空格和行尾字符，可以将第二行

line.rstrip（）

替换为just

line

，并将第二行简化为：

    file1set = set(file1)

另外，在Python 3.1中，

with

语句允许多个项，因此三个

with

语句可以组合成一个。

还有另一个变体（仅是对其他方案的语法更改，使用Python也有多种方法）

旁注：我们还应该提供另一个perl版本，一个在《不太完美》中提出的版本。。。下面是我的python版本的perl等价物。看起来和最初的不太一样。我想指出的是，在建议的答案中，问题的算法和语言独立性与perl和python相当

use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}

使用严格；
打开我的$file1，“还有另一个变体（仅仅是对其他方案的语法更改，使用python也有多种方法）
旁注：我们还应该提供另一个perl版本，一个在《不太完美》中提出的版本。。。下面是我的python版本的perl等价物。看起来和最初的不太一样。我想指出的是，在建议的答案中，问题的算法和语言独立性与perl和python相当
use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}

使用严格；
打开我的$file1，'最好使用readlines（）而不是read（）.split（）。file2中的唯一行是file2-file1（设置差异）。使用|生成集合组合，将两个文件中的所有行作为一个集合。感谢您的帮助，但此脚本不会保留每个文件中完整行的完整性。这会在空格上拆分行并输出单个单词。我已将其更改为readlines（）。请使用difflib（如果您感兴趣，请查看filecmp）按您想要的顺序完成这类工作。它更简单，而且模块还有其他您可能感兴趣的选项。最好使用readlines（）而不是read（）。split（）。file2中的唯一行是file2-file1（设置差异）。使用|生成集合组合，将两个文件中的所有行作为一个集合。感谢您的帮助，但此脚本不会保留每个文件中完整行的完整性。这会在空格上拆分行并输出单个单词。我已将其更改为readlines（）。请使用difflib（如果您感兴趣，请查看filecmp）按您想要的顺序完成这类工作。这很容易，而且模块还有其他您可能感兴趣的选项。请发布您的Python成果，这样我们就可以看到您知道什么和您有什么问题。请发布您的Python成果，这样我们就可以看到您知道什么和您有什么问题。谢谢。我用两个大文件对此进行了测试，其工作原理与perl脚本类似。速度非常快。比较2个文件需要3秒钟：文件1中有150000条记录，文件2中有200000条记录。看看你的两个脚本，python看起来更干净。使用perl版本测试文件的性能如何？我的perl版本可以很容易地进行优化（显然，它在file2hash上执行相同的循环两次）。但是我打赌大部分时间都花在IO上，所以版本之间应该没有太大差异。为了$Detity的爱，请检查系统调用（如open）的返回值。或顶部的“使用自动模具”自动完成。此外，请“使用严格”并声明变量。我很欣赏您希望保持代码短，但认为不熟练的程序员可以复制和粘贴它。至于优化，您当然可以丢失双循环，但也可以去掉map和grep的额外{}块，一次打印整个@$results数组，而不是循环。我喜欢你的python版本，但我想指出的是，到目前为止，这些版本之间的差异不仅仅是语法上的。除了一个版本（包括perl版本）之外，其他所有版本都为两个文件构建dicts/hash。您的解决方案很好地避免了在创建集合之前第一次读取file2的内存all
（就像readlines
解决方案那样）。只有一组的解决方案只需要足够的内存来保存原始文件的dict。对于具有大量记录的文件（在这种情况下，200000可能很小），这样的性能特征