Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
linux-sed循环遍历文件中的行列表_Linux_Loops_Sed - Fatal编程技术网

linux-sed循环遍历文件中的行列表

linux-sed循环遍历文件中的行列表,linux,loops,sed,Linux,Loops,Sed,我在循环文件a中的行列表时遇到一些问题,并使用每一行查找文件B中的匹配项,然后打印出文件B中的多行 这就是文件A的外观 Nitab4.5_0000062g0520.1 Nitab4.5_0000436g0070.1 Nitab4.5_0000375g0110.1 这就是文件B的外观 Nitab4.5_0000062g0520.1锌指,CCHC型,纤维连接蛋白结合A,N端,未知功能域DUF814,未知功能蛋白DUF3441 MVKVRMNTADVAEVKCLRRLIGMRCSNVYDLSPKTYV

我在循环文件a中的行列表时遇到一些问题,并使用每一行查找文件B中的匹配项,然后打印出文件B中的多行

这就是文件A的外观

Nitab4.5_0000062g0520.1

Nitab4.5_0000436g0070.1

Nitab4.5_0000375g0110.1

这就是文件B的外观

Nitab4.5_0000062g0520.1锌指,CCHC型,纤维连接蛋白结合A,N端,未知功能域DUF814,未知功能蛋白DUF3441 MVKVRMNTADVAEVKCLRRLIGMRCSNVYDLSPKTYVFKLMNSSGVTESEKVLM ESGVRLHttdylRDKSNTPSGFTLKlRkhirtrrledvrqlgydrivlfqlganahyv IlelyaQgnillTdSDFMVMTLLRSHRDDKGLAIMSHRYPVECRvfKrtteklqaa LMSSAETDKNEGVEDNEQGNDGSDALQKQGNRKNIKATTDSTKKMIDGVRAKSPTLKVVL GEALGYGPALSEHIILDAGLVPNAKIGGFELEGEMLHSLIEAVKQFEDWLEDVILGEKV PEGYILMQKALSKDSMCNNGASEKMYDEFCPLLLLNQFKSRDFMKFEAFNALDEFYS Kiesqrseqqkakestamqlkirtdqenrvtlkqevehciktaeleynledvdaa Ilavrvalangmswedlarmvkeekrsgnpvaglidklhlernctlllsnnldemddde KTQPVDKEVDLALSAHANARRWYEMKRQESKQETVtahekafaerKTRLQLSQUEK TVAVISHMRKvhwfekfvssenylvisgrdaqnemivkrymskgdlyvhaelhgas STVIKNHKPEMPIPPLTLNQAGCFTCQSQAWDSKIVTSAWVYPNQVSTAPTGEYLTV GSFMIRGKNFLPPHPLIMGFGILFRLDESLGFHLNRVRGEEGEGLNDAEQSDPSLAI PDSDSEELSMETSVDKDITDVPNDRSVAGTSYEVQSNSLLSISDKVTNSHNSSVKVN 新GLSDSLGIMATSGTSQLEDRALEIGSTASTKNHGVPPLLGSAGQDNEEKK VTQREKPYITKAERRKKGSDSTEGAPARQEKQSEKNQKAQKQCDEDVNSKSGGGKVI RGQKKKKKIKYADQDEERRIRRMALLASAGKEVEKVDQtiqSEKVDAEPKGATTG 脚踏板安装和驱动 IheigeekeklnDVDYLTG NPLPDILLYAVPVCGPYNALQSYRVKLVPGTVGKA AktamNLFSHMPEATARESERKELMKACTDPELVAVGNVKITSAGLTQLKKKKKSNK AES

Nitab4.5_0000375g0110.1四三肽样螺旋,NSF附着蛋白,四三肽重复,苹果酸脱氢酶,活性位点,四三肽重复结构域 MGDQiargeefekkaekklsgwglfgskhddaaddkancfklakswdQagavyvkva Ncylkldskheaagayanaahcykktnttreaiscleqavhmfoldigrllnmsarykeiae 最新版本 hsvnnllkygvrghlnagicqglcdvvannalleryqelptfsgsgtrecklllvdlaa AIDEDVAKFTGSVKEYDSMTKLDALRTTLLLRVKEALKAKELEEDLT

Nitab4.5_0000062g0530.1 DNA聚合酶,棕榈结构域,DNA定向DNA聚合酶,B族,保守位点,DNA定向DNA聚合酶,B族,多功能结构域,DNA定向DNA聚合酶,B族 MARVTGVPISFLLARGQSIKVLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG FYEKPIATLDFASLYPSIMMAYLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ Kgilpeellellarkarkadlkadldgrqlalkisansvygftgatvgqlp CleissvTsygrqmiekklvedkftvlkgyehnaviygddsvmvqfgvptveeam Klgreaadhisetfikplrlefekiyylliskkryaglwtnpdkhdkmdakgellat

Nitab4.5_0005502g0010.1 CDC6,C-末端结构域,含核苷三磷酸水解酶的p-环,细胞分裂蛋白CDC6/18,翼螺旋-转螺旋DNA结合结构域 MPTipVRRSPRISGGSKVAGVQTVSRNEIGVSTPSKIRSDSTTEDNVTSTLTPSPMI spckwksprcrcvndsplnargdktinlskspvkrlsflekpiwnprdmeqlna VKEALHVSRAPSNLVCRQVEQNRVLEFCKQAVKIEKAGSLYVCGCPGTGKSLSMEKVKEV LVNWADESGFQAPDILSVNCTSLSNTSDIFGKMLDKIQPRRKLNCSTAPLQYLQKMFSEK QQPAGTKMLIVADLDYLITKVvlhelfmltTSPFSRFiligianaidladrflpkl QSMNCEYFPSCKPAVITFCAYSKDQIISILQQRFEKVASASGDMRKALWVCRLVNIARL 附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物附属物 ALSKAYRSPVVDTIQSLPQHQQIILCSAVKLFRGKKDATIGELNISYLDVCKSTLIPPV GIMELSCMRVLGDQGILKVGKAREKLSRVTLKVDEADITFALQA

Nitab4.5_0005502g0020.1 MVIEQCDDEGVQPYEQLMDGQNYSQAQTCHDGQSNDFNNSADTEQNDSDSGTIDVQI nsrnqfigkegrklasflgivatpeltplqckkwd

Nitab4.5_0005502g0030.1 矿业公司 IELTKLNGKQNEEMSSMKPELLWMRKVMCKIAPNELYMSKNINEISIGQVTQKKFv LKH

Nitab4.5_0005502g0040.1核糖体蛋白L10/酸性P0,核糖体蛋白L10/L12 mavkvtkaekvnydkklcklldtyqqilivgadnvgsnqlqmirkglrgdsivlmgknt mmkrsirihaktgnnaflipclvgnvgliftrgdlkevsdevsdevskykvgaparvglva PIDVVPPGNTGLDPSQTSFFQVLNIPTKINKGTVEITIPVEIIKGEKVGSSESALLSK LGIKPFSYGLIVQFvYDSGVFSPEVLDLTEDDLIAKFAAGLSNVGLSMLLSYPTLAI PHMFingyKNVLSFAATEYSFPQAEKVKEYLKDPSKfataiaapvatkPavkPavatakee kkeepaeddfvgglfd

我想打印出以>NitabXXXX开头的描述行和以下氨基酸序列。如果在文件A中发现了基因ID Nitab4.5_xxxxx,则在文件B中以大写字母表示。在文件B中,氨基酸序列以多行分隔

这是我到目前为止提出的代码

while IFS= read -r Gene_ID; do sed -n '/$Gene_ID/,/>Nitab4.5/p' File B | sed '$d'; done < File A 

代码使用指定的基因ID工作,没有循环。但在添加循环后,我无法让它工作。我不熟悉Linux和sed。希望有人能指出错误并帮助我更正代码。谢谢

您的问题有点让人困惑,但您是否需要这个简单的命令

grep -f FILE_A -A 1 FILE_B
这些选项执行以下操作:

awk '
    BEGIN {RS=ORS="\n\n"; FS="\n"}
    NR==FNR {
        for (i=1; i<=NF; i++) nitab[$i]
        next
    }
    {
        if (match($1, /^>[^[:blank:]]+/)) {
            str = substr($1, 0, RLENGTH)
            if (str in nitab) print
        }
    }
' FileA FileB
-f文件 奥贝 文件中的n个图案,每行一个。空文件包含零个模式,因此不匹配任何内容。 -一个数字 在匹配行之后打印尾随上下文的NUM行。在连续的匹配组之间放置一行,该行包含-group separator下描述的组分隔符


首先,让我们尝试打印FileB中的第三个条目。我把它叫做FileB而不是FileB,因为文件名中的空格是个大麻烦

sed -n '/Nitab4.5_0000062g0530.1/,/>Nitab4.5/p' FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT

>Nitab4.5_0005502g0010.1 CDC6, C-terminal domain, P-loop containing nucleoside triphosphate hydrolase, Cell division protein Cdc6/18, Winged helix-turn-helix DNA-binding domain
它拾取下一个条目的第一行。因此,我们不在>Nitab4.5处终止,而是在空行处终止:

sed -n '/Nitab4.5_0000062g0530.1/,/^$/p' FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT
现在使用变量执行此操作:

line=Nitab4.5_0000062g0530.1; sed -n '/$line/,/^$/p' FileB
我们什么也得不到,因为shell将$line传递给sed,sed对这意味着什么有自己的想法。要让shell在将变量传递给sed之前展开变量,必须使用双引号:

line=Nitab4.5_0000062g0530.1; sed -n "/$line/,/^$/p" FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT
如果这是令人满意的,我们可以开始循环。始终从简单的事情开始:

while read line; do echo $line; done < FileA
Nitab4.5_0000062g0520.1

Nitab4.5_0000436g0070.1

Nitab4.5_0000375g0110.1
现在我们把所有这些放在一起:

sed '/^$/d' FileA | while read line; do sed -n "/$line/,/^$/p" FileB; done 

感谢您更新您的输入文件。 如果您选择awk,请尝试以下方法:

awk '
    BEGIN {RS=ORS="\n\n"; FS="\n"}
    NR==FNR {
        for (i=1; i<=NF; i++) nitab[$i]
        next
    }
    {
        if (match($1, /^>[^[:blank:]]+/)) {
            str = substr($1, 0, RLENGTH)
            if (str in nitab) print
        }
    }
' FileA FileB
[解释]

BEGIN块将输入/输出记录修复程序分配给双换行符和字段 换行符的分隔符。它允许处理描述行的一组段落和 氨基酸行作为记录。 读取参数列表中的第一个文件时,条件FR==FNR返回TRUE=FileA only。 该习惯用法可用于根据输入文件切换过程。 i=1的循环;i[^[:blank:][]+/提取>NitabXXX 文件B中记录的子字符串,对应于文件a的行。 然后将变量str分配给子字符串。 如果str与数组nitab的任何条目匹配,则打印记录。
根据您的解释,描述行应以>Nitab开头,但提供的文件B不包含行开头的字符>。哪个是正确的?此外,我在文件A和文件B之间的NitabXXXX字符串中找不到匹配项。在我的文件B中,描述行以>符号开头,我复制并粘贴了文件B的一部分到问题中,我想这就是为什么找不到匹配项的原因。我不知道为什么>符号没有出现在问题中。我重新编辑了问题,现在文件B中的术语与文件A中的第一行和第三行相匹配。感谢您的帮助!对不起,混淆了。在文件B中,每个基因都在一行上,对吗?在文件B中,每个基因以Nitab4.5开头的描述行都在一行上。然而,以下氨基酸序列存在于不同基因的多个不同行中。@jing,因为在这些行前面的>被错误地呈现。请在行前加4个空格,然后检查格式。谢谢你的帮助!我试过这个代码,它部分起作用。问题是1氨基酸序列大写字母存在于多个不同的行中,2代码打印出描述行加上下面一行与文件A中最后一行匹配的氨基酸序列。非常感谢!真的很感激!
>Nitab4.5_0000062g0520.1 Zinc finger, CCHC-type, Fibronectin-binding A, N-terminal, Domain of unknown function DUF814, Protein of unknown function DUF3441
MVKVRMNTADVAAEVKCLRRLIGMRCSNVYDLSPKTYVFKLMNSSGVTESGESEKVLLLM
ESGVRLHTTDYLRDKSNTPSGFTLKLRKHIRTRRLEDVRQLGYDRIVLFQFGLGANAHYV
ILELYAQGNILLTDSDFMVMTLLRSHRDDDKGLAIMSRHRYPVEICRVFKRTTTEKLQAA
LMSSAETDKNEGVEDNEQGNDGSDALQQKQGNRKNIKATDSTKKMIDGVRAKSPTLKVVL
GEALGYGPALSEHIILDAGLVPNAKIGKGFELEGEMLHSLIEAVKQFEDWLEDVILGEKV
PEGYILMQQKALSKKDSSMCNNGASEKMYDEFCPLLLNQFKSRDFMKFEAFNAALDEFYS
KIESQRSEQQQKAKESTAMQKLNKIRTDQENRVVTLKQEVEHCIKTAELIEYNLEDVDAA
ILAVRVALANGMSWEDLARMVKEEKRSGNPVAGLIDKLHLERNCMTLLLSNNLDEMDDDE
KTQPVDKVEVDLALSAHANARRWYEMKKRQESKQEKTVTAHEKAFKAAERKTRLQLSQEK
TVAVISHMRKVHWFEKFNWFVSSENYLVISGRDAQQNEMIVKRYMSKGDLYVHAELHGAS
STVIKNHKPEMPIPPLTLNQAGCFTVCQSQAWDSKIVTSAWWVYPNQVSKTAPTGEYLTV
GSFMIRGKKNFLPPHPLIMGFGILFRLDESSLGFHLNERRVRGEEEGLNDAEQSDPSLAI
PDSDSEEELSMETSVDKDITDVPNDRSSVAGTSYEVQSNSLLSISDDKVTNSHNSSVKVN
SINNDGLSDSLGIMATSGTSQLEDLIDRALEIGSSTASTKNHGVPPLLGSAGQQDNEEKK
VTQREKPYITKAERRKLKKGSDSTEGAPARQEKQSEKNQKAQKQCDEDVNNSKSGGGKVI
RGQKGKLKKIKEKYADQDEEERRIRMALLASAGKVEKVDQTIQSEKVDAEPDKGAKATTG
PEDASKICYKCKKVGHLSRDCQENSDESLQSTANGGDGHSLTSAGNAANDRDRIVMEEED
IHEIGEEEKEKLNDVDYLTGNPLPNDILLYAVPVCGPYNALQSYKYRVKLVPGTVKKGKA
AKTAMNLFSHMPEATSREKELMKACTDPELVAAVKGNVKITSAGLTQLKQKQKKSKKSNK
AES

>Nitab4.5_0000375g0110.1 Tetratricopeptide-like helical, NSF attachment protein, Tetratricopeptide repeat, Malate dehydrogenase, active site, Tetratricopepti
de repeat-containing domain
MGDQIARGEEFEKKAEKKLSGWGLFGSKHDDAADLFDKAANCFKLAKSWDQAGAVYVKVA
NCYLKLDSKHEAAGAYANAAHCYKKTNTREAISCLEQAVHMFLDIGRLNMSARYYKEIAE
LYEQEQNLEQAIIYYEKAADLFQSEDVTTSANQCKQKIAQFSAELEKYQRAIEIFEEIAR
HSVNNNLLKYGVRGHLLNAGICQLCKGDVVAINNALERYQELDPTFSGTRECKLLVDLAA
AIDEEDVAKFTGSVKEYDSMTKLDALRTTLLLRVKEALKAKELEEDDLT