Python 如何将每行的第1列数据添加到每列的标题中,该标题由响应行中的特定字符串或字符标记?
我有一大块数据(一个文件),如下所示,每行有不同数量的列(由tab分隔),数据结构如下 这: 在上面的文件中,第1行有2列,第2行有5列,第3行有m+1列。。。;显然,文件的每一行都有“>accessionID”和“matchnumber\u i\u XXX”。我想将每行的第1列添加到响应行中标有“matchnumber”的每列标题,并以fasta格式打印,输出如下:Python 如何将每行的第1列数据添加到每列的标题中,该标题由响应行中的特定字符串或字符标记?,python,regex,bash,awk,sed,Python,Regex,Bash,Awk,Sed,我有一大块数据(一个文件),如下所示,每行有不同数量的列(由tab分隔),数据结构如下 这: 在上面的文件中,第1行有2列,第2行有5列,第3行有m+1列。。。;显然,文件的每一行都有“>accessionID”和“matchnumber\u i\u XXX”。我想将每行的第1列添加到响应行中标有“matchnumber”的每列标题,并以fasta格式打印,输出如下: >NP_12345.1matchnumber_1 RKHKK >NP_56789.2matchnumber_1
>NP_12345.1matchnumber_1
RKHKK
>NP_56789.2matchnumber_1
HGRR
>NP_56789.2matchnumber_2
KQRHH
>NP_56789.2matchnumber_3
RVRK
>NP_56789.2matchnumber_4
HTHH
>XP_543421.1matchnumber_1
RQRH
....
>XP_543421.1matchnumber_m
RVRR
...
有人能帮我吗?提前谢谢
注意:例如,当存在单行文件时,“a.txt”文件只有一行内容:
>NP_56789.2 matchnumber_1_HGRR matchnumber_2_KQRHH matchnumber_3_RVRK matchnumber_4_HTHH
我可以使用管道awk和sed命令解析数据:
cat a.txt |awk -v OFS="\t" '{print $1$2,$1$3,$1$4,$1$5}' | sed 's/\t/\n/g' | sed 's/_/ /g' | sed 's/NP /NP_/g' | sed 's/matchnumber /matchnumber_/g' | sed 's/ /\n/g' > a.fasta
a.fasta与家禽养殖场一样:
>NP_56789.2matchnumber_1
HGRR
>NP_56789.2matchnumber_2
KQRHH
>NP_56789.2matchnumber_3
RVRK
>NP_56789.2matchnumber_4
HTHH
当a.txt有多行数据时,我不知道如何解决这个问题。Perl来拯救它
$ cat james.txt
>NP_12345.1 matchnumber_1_RKHKK
>NP_56789.2 matchnumber_1_HGRR matchnumber_2_KQRHH matchnumber_3_RVRK matchnumber_4_HTHH
>XP_543421.1 matchnumber_1_RQRH matchnumber_2_QQQQ
$ perl -lne ' /(^\S+) (.+)/;$pre=$1;$mat=$2;while($mat=~/(match.+?_\d+)_(\S+)/g) { print "$pre $1\n$2" } ' james.txt
>NP_12345.1 matchnumber_1
RKHKK
>NP_56789.2 matchnumber_1
HGRR
>NP_56789.2 matchnumber_2
KQRHH
>NP_56789.2 matchnumber_3
RVRK
>NP_56789.2 matchnumber_4
HTHH
>XP_543421.1 matchnumber_1
RQRH
>XP_543421.1 matchnumber_2
QQQQ
$
解释
perl -lne
# -l to remove newline when reading and add newline when print statement is used
# -n - suppress default printing at the end of each line processing
# -e - for perl on commandline
' /(^\S+) (.+)/;
split line by first word (^\S+) -> matches first column and stores it in $1 since we used () to capture
the second (.+) stores the rest of the text in $2
$pre=$1;$mat=$2;
Assign $1 to $pre and $2 to $mat
while($mat=~/(match.+?_\d+)_(\S+)/g)
Now mat stores from 2nd column to the rest of the line.
// => match with regex and (match.+?_\d+) to capture it in $1
(\S+) => captures the "HGRR"
/g => like this we have many matches.. so 'g'lobally repeat the matching
to get all of them using the while loop. If /g is ignored, then we will just get first match alone.
{ print "$pre $1\n$2" }
Now print $pre, $1 newline and $2 --> This $1 and $2 is local to the while loop and
don't get confused with the earlier $1 and $2 which we assigned to $pre and $mat
for each while loop turn $1 and $2 match different values and get printed.
$cat jfile
>NP_12345.1匹配号_1_RKHKK
>NP_56789.2匹配号_1_HGRR匹配号_2_KQRHH匹配号_3_RVRK匹配号_4_HTHH
$awk-F“\t”{for(i=2;i另一个perl一行程序:
perl -anE '($c1,@r)=split/\s+/,$_;for(@r){($c,$v)=$_=~/^(.+)_(.+)$/;say "$c1 $c\n$v"}' file.txt
>NP_12345.1 matchnumber_1
RKHKK
>NP_56789.2 matchnumber_1
HGRR
>NP_56789.2 matchnumber_2
KQRHH
>NP_56789.2 matchnumber_3
RVRK
>NP_56789.2 matchnumber_4
HTHH
>XP_543421.1 matchnumber_1
RQRH
>XP_543421.1 matchnumber_2
RQRH
>XP_543421.1 matchnumber_3
RQRH
说明:
($c1,@r)=split/\s+/,$_; # split allline into 1 col value and rest of the line
for(@r){ # for each lols othar than 1rst one
($c,$v)=$_=~/^(.+)_(.+)$/; # extract before the last underscore and after it
say "$c1 $c\n$v" # print col1 coln linebreak value
}
Python不擅长单行命令,但使用它可以轻松解析文件:
parser.py:
import fileinput
for line in fileinput.input(): # process stdin or files given as parameters
words = line.split() # split the line
for w in words[1:]: # process all words past the first
ix = w.rindex('_') # search last _ in the words
print(words[0] + w[:ix]) # print first line
print(w[ix+1:]) # and second one
然后,您可以使用:
cat file | python parse.py
或:
以下内容(用于扩展)可能适合您:
awk '{for(i=2;i<=NF;i++){print $1 gensub(/_([^_]+)$/,"\n\\1",1,$i)}}' file
awk'{for(i=2;i这可能适合您(GNU-sed):
制作当前行的副本。使用模式匹配将模式空间中的第一个和第二个字段处理为所需格式并打印。交换到副本,并通过删除第二个字段和任何后续空白来缩短它。重复操作,直到模式匹配失败。很高兴它工作得很快!。我将为answ添加解释可以将“$c1,@r)=split/\s++;m/^(+++)$/”缩写为“$c1$1\n$2”表示(@r)“
@stack0114106:True。我使用:python parse.py file>output获得结果。非常感谢,当数据更复杂时,例如:>NP_12345.1 matchnumber_1_starto=17~21_rkhk>NP_56789.2 matchnumber_1_starto=26~29_hgr matchnumber_2_starto=98~102_KQRHH matchnumber_3_starto=108~112_rk matchnumber_4_starto=123~126_HTHH,我将您的代码重写为“awk-F”\t{for(I=2;I
cat file | python parse.py
python parse.py file
awk '{for(i=2;i<=NF;i++){print $1 gensub(/_([^_]+)$/,"\n\\1",1,$i)}}' file
sed -r ':a;h;/^(\S+)\s+(\S+)_(\S+)\s*(.*)/{s//\1\2\n\3/p;x;s//\1 \4/;ta};d' file