Python 在大文件中找到该单词并复制包含该单词的行

Python 在大文件中找到该单词并复制包含该单词的行,python,python-3.x,shell,perl,sh,Python,Python 3.x,Shell,Perl,Sh,我有两个文件,即文件A和文件B。文件A每行包含一个单词,文件B包含句子。我必须阅读文件A中的单词并搜索文件B中以该单词开头的行,然后将整行复制到文件C。文件A和文件B都已排序 比如说 文件A: he I there he was at least equally intrigued by hers. I guess he's going to use it in his business. I don't know if he's angry or not. there were five d

我有两个文件,即文件A和文件B。文件A每行包含一个单词,文件B包含句子。我必须阅读文件A中的单词并搜索文件B中以该单词开头的行,然后将整行复制到文件C。文件A和文件B都已排序

比如说

文件A:

he
I
there
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
文件B:

he
I
there
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
文件\u C:

he
I
there
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
我尝试使用shell脚本,但这是一种启发式方法,所以需要很长时间。文件A和文件B都是大文件

这是我试过的代码

#! /bin/bash

for first in `cat File_A`
do
    while read line 
    do
        first_col=$(echo $line|head -n1 | awk '{print $1;}')
        if [[ "$first" == "$first_col" ]]
        then
                 echo $line >> File_C
            fi  

    done <File_B
done
#/bin/bash
对于“cat文件”中的第一个`
做
读行时
做
第一列=$(echo$行| head-n1 | awk'{print$1;}')
如果[[“$first”==“$first\U col”]]
然后
echo$line>>文件
fi

在理解
的shell中完成。请查看以下基于shell脚本创建的代码

use strict;
use warnings;
use feature 'say';

my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';

# read File_A into array @data_a
open my $fh_a, '<', $file_a
    or die "Couldn't open $file_a $!";

my @data_a = <$fh_a>;

close $fh_a;

# read File_B into array @data_b
open my $fh_b, '<', $file_b
    or die "Couldn't open $file_b $!";

my @data_b = <$fh_b>;

close $fh_b;

chomp @data_a;      # snip eol
chomp @data_b;      # snip eol

# store found result into File_C
open my $fh_c, '>', $file_c
    or die "Couldn't open $file_b $!";

for my $word_a (@data_a) {
    for my $line_b (@data_b) {
        say $fh_c $line_b if $line_b =~ /^$word_a\b/;
    }
}

close $fh_c;
输入文件

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
结果文件

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

Perl中的类似内容:

#!/usr/bin/perl

use strict;
use warnings;

# Open File_A
open my $fh_a, '<', 'File_A' or die $!;

# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);

# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);

# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;

# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
  print $fh_c $_ if /$word_re/;
}
#/usr/bin/perl
严格使用;
使用警告;
#打开文件

打开我的$fh_a,'如果你在自己解决这个问题时有具体问题,你可以在这里询问。您还应该首先决定要使用哪种编程语言。请展示您的努力:将代码包括在问题中,即使它不起作用。@MichaelButscher我已经标记了编程语言。@DYZ这是我所做的代码#/bin/bash表示第一个in
cat文件A
do,同时读取行do first\u col=$(echo$line | head-n1 | awk'{print$1;}),如果[[“$first”==“$first\u col”]],则echo$line>>文件fi完成,请将其作为问题的一部分。你希望我们在评论中阅读未格式化的shell脚本吗?“搜索文件B中以“@DYZ-Doh”开头的行。修正了。@Shawn谢谢你的回答。但它犯了一些错误。它必须与单词完全匹配。例如文件A包含单词“that”,然后它会包含所有“that”、“thall”、“that's”的句子。我只希望带来“那个”。@Amoll撇号算作断字。您必须在从源单词创建的正则表达式中添加一个尾随空格检查,而不是使用
-w
——这样的边缘大小写应该包含在示例数据中。
我们不知道他在做什么。
这一行不应该按照您所需要的OPU输出
/\b$word\u a\b/
/^$word\u a\b/
。问题是“文件B中以该词开头的行”。@Dave Cross——对不起,我在OP的帖子中漏掉了“开始”这个词(更正)