使用awk/sed从具有特定图案的线条中提取信息_Awk_Sed

使用awk/sed从具有特定图案的线条中提取信息

awk sed

使用awk/sed从具有特定图案的线条中提取信息,awk,sed,Awk,Sed,我有一个这样的文件，即 A 10 20 bob.1 ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 A 20 30 bob.2 ID=bob.2;Parent=bob;Note=bob_v1;conf=XF 使用下面的命令行，我将信息提取为conf的一个单独列 sed -Ei 's/(.*conf=)([^;]*)(;.*)/\1\2\3\t\2/g' my_file 但是，如果在conf的末尾有这个符号；它起作用了。否则不会。在这两种

我有一个这样的文件，即

A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF

使用下面的命令行，我将信息提取为conf的一个单独列

sed -Ei 's/(.*conf=)([^;]*)(;.*)/\1\2\3\t\2/g' my_file

但是，如果在conf的末尾有这个符号；它起作用了。否则不会。在这两种情况下，如何修改脚本以提取模式，如下图所示，以及在put tab为空的情况下如何修改脚本

A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1  XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF  XF

我将此链接用作参考：

您实际上可以删除；：

[^；]*是一个反括号表达式，它将只匹配0或更多字符，因为*字符不是；，因此,；没有必要出现在模式本身中，前面的模式已经受到限制

见：

您实际上可以删除；：

[^；]*是一个反括号表达式，它将只匹配0或更多字符，因为*字符不是；，因此,；没有必要出现在模式本身中，前面的模式已经受到限制

见：

你能试试下面的awk吗

说明：现在为上述代码添加说明

awk '                                        ##Starting awk program here.
match($0,/conf=[^;]*/){                      ##Using match function of awk to match regex from string conf= till semi colon comes.
   print $0,substr($0,RSTART+5,RLENGTH-5)    ##Printing current line and then sub-string whose starting point of RSTART+5 and ending point is RLENGTH-5
   next                                      ##next will skip all further statements from here.
}                                            ##Closing BLOCK for match function here.
1                                            ##Mentioning 1 will print lines, those ones which are not having conf string match so it will simply print them.
'  Input_file                                ##Mentioning Input_file name here.

输出如下

A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF

你能试试下面的awk吗

说明：现在为上述代码添加说明

awk '                                        ##Starting awk program here.
match($0,/conf=[^;]*/){                      ##Using match function of awk to match regex from string conf= till semi colon comes.
   print $0,substr($0,RSTART+5,RLENGTH-5)    ##Printing current line and then sub-string whose starting point of RSTART+5 and ending point is RLENGTH-5
   next                                      ##next will skip all further statements from here.
}                                            ##Closing BLOCK for match function here.
1                                            ##Mentioning 1 will print lines, those ones which are not having conf string match so it will simply print them.
'  Input_file                                ##Mentioning Input_file name here.

输出如下

A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF

每当您有name=value输入数据时，我发现创建一个表示以下关系f[name]=value的数组是最简单、最健壮、最灵活的，这样您就可以通过名称访问这些值。根据“放置”选项卡为空时的含义：

或：

您可以尝试Perl一行程序

$ perl -lne ' /conf=(\w+)/ and $_.=" $1"; print ' conf.txt
A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF
$

甚至更短

$ perl -lne ' /conf=(\w+)/ and print "$_ $1" ' conf.txt
A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF

您可以尝试Perl一行程序

$ perl -lne ' /conf=(\w+)/ and $_.=" $1"; print ' conf.txt
A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF
$

甚至更短

$ perl -lne ' /conf=(\w+)/ and print "$_ $1" ' conf.txt
A   10  20  bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A   20  30  bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF

我们不应该要求；在\3中-由于它已在\2中的排除字符列表中处理：

如果我们需要与其他人抗争，而不是；作为分隔符，我们将其包含在\2的字符列表中。这样的字符可以是\t或空格

sed -Ei 's/(.*conf=)([^;\t ]*)(.*)/\1\2\3\t\2/' my_file

我们不应该要求；在\3中-由于它已在\2中的排除字符列表中处理：

如果我们需要与其他人抗争，而不是；作为分隔符，我们将其包含在\2的字符列表中。这样的字符可以是\t或空格

sed -Ei 's/(.*conf=)([^;\t ]*)(.*)/\1\2\3\t\2/' my_file

与此问题相关的问题的大致直接副本：

BEGIN { OFS = FS = "\t" }

function get_attrib_by_name(key,  n,attrib,kv) {
    # Split the attribute field on semi-colons.
    n = split($5, attrib, ";")

    # Loop over the attributes and split each on "=".
    # When we've found the one we're looking for (by key name in "key"),
    # return the corresponding value.
    for (i = 1; i <= n; ++i) {
        split(attrib[i], kv, "=")
        if (kv[1] == key) {
            return kv[2]
        }
    }
}

# Using the above function.
{
    name = get_attrib_by_name("conf")
    print $0, name
}

与此问题相关的问题的大致直接副本：

BEGIN { OFS = FS = "\t" }

function get_attrib_by_name(key,  n,attrib,kv) {
    # Split the attribute field on semi-colons.
    n = split($5, attrib, ";")

    # Loop over the attributes and split each on "=".
    # When we've found the one we're looking for (by key name in "key"),
    # return the corresponding value.
    for (i = 1; i <= n; ++i) {
        split(attrib[i], kv, "=")
        if (kv[1] == key) {
            return kv[2]
        }
    }
}

# Using the above function.
{
    name = get_attrib_by_name("conf")
    print $0, name
}

当您说in case为空时，您的意思是在上面的输出中有一个制表符而不是XF，还是说上面的XFs前面应该有一个制表符，在空的情况下，它应该是tab然后为null，或者您的意思是其他什么？在示例输入/输出中包含该大小写。当您说in case为空时，将tab放在上面的输出中是指有一个tab而不是XF，还是说上面的XFs前面应该有一个tab，在空的情况下它只是tab然后为null，还是指其他内容？在您的示例输入/输出中包括该案例。

BEGIN { OFS = FS = "\t" }

function get_attrib_by_name(key,  n,attrib,kv) {
    # Split the attribute field on semi-colons.
    n = split($5, attrib, ";")

    # Loop over the attributes and split each on "=".
    # When we've found the one we're looking for (by key name in "key"),
    # return the corresponding value.
    for (i = 1; i <= n; ++i) {
        split(attrib[i], kv, "=")
        if (kv[1] == key) {
            return kv[2]
        }
    }
}

# Using the above function.
{
    name = get_attrib_by_name("conf")
    print $0, name
}

$ awk -f script.awk file.gff
A       10      20      bob.1   ID=bob.1;Parent=bob;conf=XF;Note=bob_v1 XF
A       20      30      bob.2   ID=bob.2;Parent=bob;Note=bob_v1;conf=XF XF