AWK-使用密钥比较两个文件并打印摘要(缺失、相同、不同)
我有两个像这样的文件。这两个文件在第1个然后是第2个字段上排序。一个ID可以有多行 归档AWK-使用密钥比较两个文件并打印摘要(缺失、相同、不同),awk,compare,Awk,Compare,我有两个像这样的文件。这两个文件在第1个然后是第2个字段上排序。一个ID可以有多行 归档 3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1 3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1 3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1 3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1 3
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1
文件b
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
第一行是一个键;3个字段,管道分开。文件大约为1gb。
我要做的是返回一个结果集,该结果集如下所示:
3333 rows in File A
4444 rows in File B
1234 rows are identical
2345 rows are different (aka the 2nd/3rd field are different but the key matches)
111 rows in File A not in File B
222 rows in File B not in File A
下面是实现它的SQL代码,这是我的退路
--CREATE TABLE aws_hash_compare (the_filename VARCHAR(100) NOT NULL, switch_id BIGINT, hash_value CHAR(40),the_count TINYINT)
--CREATE UNIQUE CLUSTERED INDEX ucidx__awshashcompare__the_filename__switch_id ON aws_hash_compare(the_filename, switch_id)
DECLARE @mSsql_filename sysname = 'FileA'
DECLARE @mYsql_filename sysname = 'FileB'
SELECT COUNT(*) AS MSSQL FROM aws_hash_compare
WHERE the_filename = @mSsql_filename
SELECT COUNT(*) AS MYSQL FROM aws_hash_compare
WHERE the_filename = @mYsql_filename
SELECT COUNT(*) AS switch_id_match FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
SELECT COUNT(*) AS complete_match FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND mssql.hash_value = mysql.hash_value
AND mssql.the_count = mysql.the_count
SELECT COUNT(*) AS hash_differences FROM aws_hash_compare mysql
INNER JOIN aws_hash_compare mssql
ON mysql.the_filename = @mYsql_filename
AND mssql.the_filename = @mSsql_filename
AND mysql.switch_id = mssql.switch_id
AND (mssql.hash_value <> mysql.hash_value OR mssql.the_count <> mysql.the_count)
SELECT COUNT(*) AS missing_from_MSSQL FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename
AND mssql.switch_id = mysql.switch_id)
SELECT COUNT(*) AS missing_from_MYSQL FROM aws_hash_compare mssql WHERE the_filename = @mSsql_filename
AND NOT EXISTS (SELECT 1 FROM aws_hash_compare mysql WHERE the_filename = @mYsql_filename
AND mssql.switch_id = mysql.switch_id)
遵循awk可以在同样的方面帮助您
awk -F"|" '
FNR==NR{
a[$0]=$0;
b[$1];
next
}
FNR==1{
file1_count=(NR-1) " rows in " ARGV[1]
}
($1 in b) && !($0 in a){
first_field_matching++
}
($0 in a){
common++;
delete a[$0];
next
}
{
found_in_B_not_in_A++
}
END{
found_in_A_not_in_B=length(a);
print file1_count RS FNR " rows in " ARGV[2] RS common " rows are identical" \
RS first_field_matching " rows are different (aka the 2nd/3rd field are\
different but the key matches)" RS found_in_A_not_in_B " rows in File A\
not in File B" RS found_in_B_not_in_A " rows in File B not in File A"
}
' file_A file_B
假设以下是文件A和文件BI对您提供的输入_文件做了微小更改,以验证$1与其他条件相同的情况
cat file_A
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337313|cb1c00eeccb25ea5a069da63a1b0c2565379ff9c|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|786af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
3337320|1e3126f41f848509efad0b3415b003704377778c|1
cat file_B
3337312|6dc1d4397108002245c770fa66ee4d7767dcc23e|1
3337315|780055f13efffcb4bee115c6cf546af85ac6c0a7|1
3337316|19535297b9913b6bca1796b68505498d5e81b5ed|1
3337318|61a813730578c552b62de5618e1d66b1eb74b4f8|1
3337319|6af3b98f25a6a9b9d887486aefddfb53947bbf1c|1
现在,当我们运行上面的代码时,下面将是同一平台上的输出
5 rows in file_A
5 rows in file_B
2 rows are identical
1 rows are different (aka the 2nd/3rd field aredifferent but the key matches)
3 rows in File Anot in File B
3 rows in File B not in File A
这是一个使用comm比较文件,然后使用awk生成结果的版本。它可能会较慢,但可能会占用较少的内存。comm要求对其输入文件进行排序 我假设一个键在每个文件中只出现一次
comm filea fileb | awk -F'\t' '
BEGIN { na = nb = identical = common = 0 }
$1 {
split($1, f, /[|]/)
if (f[1] in b) {common++; delete b[f[1]]} else {a[f[1]]}
na++
}
$2 {
split($2, f, /[|]/)
if (f[1] in a) {common++; delete a[f[1]]} else {b[f[1]]}
nb++
}
$3 {
identical++
na++
nb++
}
END {
printf "%d rows in file A\n", na
printf "%d rows in file B\n", nb
printf "%d rows are identical\n", identical
printf "%d rows are different but share a key\n", common
printf "%d rows in file A only\n", length(a)
printf "%d rows in file B only\n", length(b)
}
'
@mbourgon,我很好奇1gb文件需要多长时间,需要消耗多少内存。如果可以的话,请告诉我。还有速度对比sql@mbourgon,不客气。我不是sql方面的专家,在awk中尝试过这个,会很好的了解学习结果。继续学习并保持共享的欢呼。我认为这里不是B的行数,反之亦然应该是2:@ Mbggon,考虑KEY=3337319:这是因为1行不同,文件A中的行A不在B中,而文件B中的行不在A中-如果行不同,则应该在非B中正确计数。文件A有5行:5!=2行相同的代码+1行与文件B具有相同的密钥+3行不在文件B中。在其中一个测试中运行它,并获得与SQL相同的结果&另一个awk脚本Sweet!。但是,我们可能在一个文件中有多个密钥实例-这就是为什么我会包含count 3rd字段,以防对具有相同开关id的不同行进行多个哈希。我目前正在等待有人获取完整文件进行比较。一旦我得到这些,我将共享时间和内存来运行这两个版本。再次感谢!