Python 从两个文件中提取相同的行，同时忽略小写/大写_Python_Bash_Perl_Unix_Awk

Python 从两个文件中提取相同的行，同时忽略小写/大写

python bash perl unix awk

Python 从两个文件中提取相同的行，同时忽略小写/大写,python,bash,perl,unix,awk,Python,Bash,Perl,Unix,Awk,其目的是从两个文件中提取相同的行，同时忽略小写/大写以及标点符号我有两个文件 source.txt Foo bar blah blah black sheep Hello World Kick the, bucket foo bar blah sheep black Hello world kick the bucket , processed.txt Foo bar blah blah black sheep Hello World Kick the, bucket foo bar b

其目的是从两个文件中提取相同的行，同时忽略小写/大写以及标点符号

我有两个文件

source.txt

Foo bar
blah blah black sheep
Hello World
Kick the, bucket

foo bar
blah sheep black
Hello world
kick the bucket ,

processed.txt

Foo bar
blah blah black sheep
Hello World
Kick the, bucket

foo bar
blah sheep black
Hello world
kick the bucket ,

所需输出（来自

source.txt

）：

我一直在这样做：

from string import punctuation
with open('source.txt', 'r') as f1, open('processed.txt', 'r') as f2:
  for i,j in zip(f1, f2):
    lower_depunct_f1 = " ".join("".join([ch.lower() for ch in f1 if f1 not in punctuation]).split())
    lower_depunct_f2 = " ".join("".join([ch.lower() for ch in f2 if f2 not in punctuation]).split())
    if lower_depunct_f1 == lower_depunct_f2:
      print f1
    else:
      print

有没有办法用

bash

工具做到这一点？perl、shell、awk、sed？

使用

awk

更容易做到这一点：

awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next}
   {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1
Foo bar

Hello World
Kick the, bucket

使用

awk

更容易做到这一点：

awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next}
   {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1
Foo bar

Hello World
Kick the, bucket

Perl解决方案与Python解决方案非常相似：

open my $S1, '<', 'source.txt'    or die $!;
open my $S2, '<', 'processed.txt' or die $!;
while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) {
    s/[[:punct:]]//g for $s1, $s2;
    $_ = lc for $s1, $s2;
    print $s1 eq $s2 ? $s1 : "\n";
}

打开我的$S1，”Perl解决方案与Python解决方案非常相似：
open my $S1, '<', 'source.txt'    or die $!;
open my $S2, '<', 'processed.txt' or die $!;
while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) {
    s/[[:punct:]]//g for $s1, $s2;
    $_ = lc for $s1, $s2;
    print $s1 eq $s2 ? $s1 : "\n";
}

打开我的$S1，”Bash解决方案，与Perl解决方案非常相似，结果相同（因为kick the bucket
后面的空格没有被删除）：
#/bin/bash
shopt-s nocasematch
exec 3 source.txt#Open source.txt并将fd 3分配给它。
exec 4 processed.txt
读时&-
Bash解决方案，与Perl解决方案非常相似，具有相同的不同结果（因为没有删除kick the bucket
后面的空格）：
#/bin/bash
shopt-s nocasematch
exec 3 source.txt#Open source.txt并将fd 3分配给它。
exec 4 processed.txt
读时&-
检查此解决方案是否有助于您：
use strict;
use warnings;

my $f1 = $ARGV[0];
open FILE1, "<", $f1 or die $!;
my $f2 = $ARGV[1];
open FILE2, "<", $f2 or die $!;


open OUTFILE, ">", "cmp.txt" or die $!;

my %seen;
while (<FILE1>) {
      $_ =~ s/[[:punct:]]//isg;     
    $seen{lc($_)} = 1;
}

while (<FILE2>) {
    my $next_line = <FILE2>;
    $_ =~ s/[[:punct:]]//isg;
    if ($seen{lc($_)}) {    
        print OUTFILE $_;
    }
}
close OUTFILE;

使用严格；
使用警告；
my$f1=$ARGV[0]；
打开文件1，“检查此解决方案是否有助于您：
use strict;
use warnings;

my $f1 = $ARGV[0];
open FILE1, "<", $f1 or die $!;
my $f2 = $ARGV[1];
open FILE2, "<", $f2 or die $!;


open OUTFILE, ">", "cmp.txt" or die $!;

my %seen;
while (<FILE1>) {
      $_ =~ s/[[:punct:]]//isg;     
    $seen{lc($_)} = 1;
}

while (<FILE2>) {
    my $next_line = <FILE2>;
    $_ =~ s/[[:punct:]]//isg;
    if ($seen{lc($_)}) {    
        print OUTFILE $_;
    }
}
close OUTFILE;

使用严格；
使用警告；
my$f1=$ARGV[0]；
打开文件1，“它是否将整个文件加载到内存中？在1.5 mil测线上进行测试时，没有打印任何内容=（它只将第一个文件中的行加载到内存中，然后将其与第二个文件进行比较。另外，你在问题中没有提到数据量太大。哦，几秒钟后它就工作了，不用担心，只要它工作得相当快，就行了。是的，它确实工作了，但我在过去几天没有获得良好的互联网连接，我“我会很快重新研究这个问题……它会将整个文件加载到内存中吗？在1.5 mil行上进行测试时，不会打印任何内容。”=（它只将第一个文件中的行加载到内存中，然后将其与第二个文件进行比较。另外，你在问题中没有提到数据量太大。哦，几秒钟后它就工作了，不用担心，只要它工作得相当快，就行了。是的，它确实工作了，但我在过去几天没有获得良好的互联网连接，我我很快就会重新考虑这个问题。。。