Linux 比较两个不同文件中的列并生成三个输出
我有多个成对文件,标题为xxx_1.txt和xxx_2.txt、yyy_1.txt和yyy_2.txt等。它们是具有以下格式的单列文件: xxx_1.txt:Linux 比较两个不同文件中的列并生成三个输出,linux,bash,awk,Linux,Bash,Awk,我有多个成对文件,标题为xxx_1.txt和xxx_2.txt、yyy_1.txt和yyy_2.txt等。它们是具有以下格式的单列文件: xxx_1.txt: #CHROM_POSREFALT MSHR1153_annotated_1_9107CA MSHR1153_annotated_1_9197CT MSHR1153_annotated_1_9303TC MSHR1153_annotated_1_10635GA MSHR1153_annotated_1_10836AG MSHR1153_an
#CHROM_POSREFALT
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
MSHR1153_annotated_1_9303TC
MSHR1153_annotated_1_10635GA
MSHR1153_annotated_1_10836AG
MSHR1153_annotated_1_11108AG
MSHR1153_annotated_1_11121GA
MSHR1153_annotated_1_11123CT
MSHR1153_annotated_1_11131CT
MSHR1153_annotated_1_11155AG
MSHR1153_annotated_1_11166CT
MSHR1153_annotated_1_11186TC
MSHR1153_annotated_1_11233TG
MSHR1153_annotated_1_11274GT
MSHR1153_annotated_1_11472CG
MSHR1153_annotated_1_11814GA
MSHR1153_annotated_1_11815CT
xxx_2.txt:
LocationMSHR1153_annotatedMSHR0491_Australasia
MSHR1153_annotated_1_56TC
MSHR1153_annotated_1_226AG
MSHR1153_annotated_1_670AG
MSHR1153_annotated_1_817CT
MSHR1153_annotated_1_1147TC
MSHR1153_annotated_1_1660TC
MSHR1153_annotated_1_2488AG
MSHR1153_annotated_1_2571GA
MSHR1153_annotated_1_2572TC
MSHR1153_annotated_1_2698TC
MSHR1153_annotated_1_2718TG
MSHR1153_annotated_1_3018TC
MSHR1153_annotated_1_3424TC
MSHR1153_annotated_1_3912CT
MSHR1153_annotated_1_4013GA
MSHR1153_annotated_1_4087GC
MSHR1153_annotated_1_4878CT
MSHR1153_annotated_1_5896GA
MSHR1153_annotated_1_7833TG
MSHR1153_annotated_1_7941CT
MSHR1153_annotated_1_8033GA
MSHR1153_annotated_1_8888AC
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
它们实际上比这个长得多。我的目标是比较每条线并产生多个输出,以便稍后创建维恩图。因此,我需要一个文件,其中列出了所有共有行,如下所示(在本例中只有一行):
一个文件列出了特定于xxx_1的所有内容,另一个文件列出了特定于xxx_2的所有内容
到目前为止,我已经得出了以下结论:
awk ' FNR==NR { position[$1]=$1; next} {if ( $1 in position ) {print $1 > "foundinboth"} else {print $1 > "uniquetofile1"}} ' FILE2 FILE1
问题是我知道如何运行300多个成对的文件,如果我使用它,每次都必须手动更改它们。它也不会同时生成所有文件。有没有一种方法可以自动循环并更改所有内容?这些文件是成对的,因此末尾的后缀是不同的“
\u 1
”和“\u 2
”。我需要它循环遍历每个配对文件,同时生成我需要的所有内容。请尝试以下操作:
for f in *_1.txt; do # find files such as "xxx_1.txt"
basename=${f%_*} # extract "xxx" portion
if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
file1="${basename}_1.txt" # assign bash variable file1
file2="${basename}_2.txt" # assign bash variable file2
both="${basename}_foundinboth.txt"
uniq1="${basename}_uniquetofile1.txt"
uniq2="${basename}_uniquetofile2.txt"
awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
# pass the variables to AWK with -v option
FNR==NR { b[$1]=$1; next }
{
if ($1 in b) {
print $1 > both
seen[$1]++ # mark if the line is found in file1
} else {
print $1 > uniq1
}
}
END {
for (i in b) {
if (! seen[i]) { # the line is not found in file1
print i > uniq2 # then it is unique to file2
}
}
}' "$file2" "$file1"
fi
done
请注意,*\u uniquetofile2.txt
中的行不保留原始顺序。
如果你需要的话,请试着自己分类或者让我知道
for f in *_1.txt; do # find files such as "xxx_1.txt"
basename=${f%_*} # extract "xxx" portion
if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
file1="${basename}_1.txt" # assign bash variable file1
file2="${basename}_2.txt" # assign bash variable file2
both="${basename}_foundinboth.txt"
uniq1="${basename}_uniquetofile1.txt"
uniq2="${basename}_uniquetofile2.txt"
awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
# pass the variables to AWK with -v option
FNR==NR { b[$1]=$1; next }
{
if ($1 in b) {
print $1 > both
seen[$1]++ # mark if the line is found in file1
} else {
print $1 > uniq1
}
}
END {
for (i in b) {
if (! seen[i]) { # the line is not found in file1
print i > uniq2 # then it is unique to file2
}
}
}' "$file2" "$file1"
fi
done