Awk 根据bash中的另一个文件删除文件的特定部分
如果Awk 根据bash中的另一个文件删除文件的特定部分,awk,fasta,Awk,Fasta,如果文件1和文件2中存在后的文本,我正在寻找一个bash命令来删除部分文本 这里有一个例子 File1: CUI02270 CUI02272 CUI02271 CUI02290 CUI02289 CUI022799 File2: >CUI02270 |hypothetical protein pCPXV0248[Cowpox virus] MGTVFVPYLLVKLALRVLVISNGYCHVPLKYIVLMIAHRVLLSSIVESTTLDIPDLRSTM ELILLTASRLK
文件1
和文件2
中存在后的文本,我正在寻找一个bash命令来删除部分文本
这里有一个例子
File1:
CUI02270
CUI02272
CUI02271
CUI02290
CUI02289
CUI022799
File2:
>CUI02270 |hypothetical protein pCPXV0248[Cowpox virus]
MGTVFVPYLLVKLALRVLVISNGYCHVPLKYIVLMIAHRVLLSSIVESTTLDIPDLRSTM
ELILLTASRLKFNLYRPNL
>CUI02271 |CPXV043 protein[Cowpox virus]
MLAFCYSLPNVGDVLKGKVYENGYALYIDLFDYPHSEAILAESVQMHMNRYFKYRDKLVG
KTVKVKVIRVDYTKGYIDVNYKRMCKHQ
>CUI02272 |hypothetical protein pCPXV0245[Cowpox virus]
MFTHPFVIDIYISFCIINSNHFNFYSFPYQFIPIFKISIHMHLNTLCQDSFRVRIVKKIN
V
>CUI02273 |CPXV044 protein[Cowpox virus]
MNPDNTIAVITETIPIGMQFDKVYLSTFNVWREILSNTTKTLDISSFYWSLLDEVGTNFG
TTILNEIVQLPKRGVRVRVAVNKSNKPLKDVETLQMAGVEVRYIDITNILGGVLHTKFWI
SDNTHIYLGSANMDWRSLTQVKELGIAIFNNRNLAADLTQIFEVYWYLGVNNLPYNWKNF
YPAYYNTDHPLSMNVSGVPHSVFIASAPQQLCTMERTNDLTALLSCIGNASKFVYVSVMN
FIPIIYSKAGNILFWPYIEDELRRTAIDRKVSVKLLISCWQRSSFIMRNFLRSIAMLKSK
NIDIEVKLFIVPDTDPPIPYSRVNHAKYMVTDKTAYIGTSNWTGNYFTDTCGTSINITPD
DGLGLRQQLEDIFMRDWNSKYSYELYDTSPTKRCRLLKNMKQCTNDIYSDEIQPEKEIPE
YSLE
>CUI02274 |CPXV045 protein[Cowpox virus]
MSANCMFNLDNDYIYCKYWKPITYPKALVFISHGAGEHSGRYDELAENISSLGILVFSHD
HIGHGRSNGEKMMIDDFGTYVRDVVQHVVTIKSTYPGVPVFLLGHSMGATISILAAYENP
NLFTAMILMSPLVNAEAVPRLNLLAAKLMGAITPNAPVGKLCPESVSRDMDEVYKYQYDP
LVNHEKIKAGFASQVLKATNKVRKIIPKINTPSLILQGTNNEISDVSGAYYFMQHANCNR
EIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK
>CUI022799 |CPXV046 protein[Cowpox virus]
MATKSDYEDAVFYFVDDDEICSRDSIIDLIDEYITWRNHVIVFNKDITSCGRLYKELMKF
DDAAIRYYGIDKINEIVEAMSEGDHYINLTEVHDQESLFATIGICAKITEHWGYKKISES
RFQSLGNITDLMTDDNINILILFLEKKLN
>CUI02276 |hypothetical protein pCPXV0240[Cowpox virus]
MDFCKIDVVVSFAHSLDNLINFINTIVPYSSIIELHQFLVESSTTGNIFVKHYNMISPRD
IFIY
我应该有一个新的文件3
,例如:
>CUI02273 |CPXV044 protein[Cowpox virus]
MNPDNTIAVITETIPIGMQFDKVYLSTFNVWREILSNTTKTLDISSFYWSLLDEVGTNFG
TTILNEIVQLPKRGVRVRVAVNKSNKPLKDVETLQMAGVEVRYIDITNILGGVLHTKFWI
SDNTHIYLGSANMDWRSLTQVKELGIAIFNNRNLAADLTQIFEVYWYLGVNNLPYNWKNF
YPAYYNTDHPLSMNVSGVPHSVFIASAPQQLCTMERTNDLTALLSCIGNASKFVYVSVMN
FIPIIYSKAGNILFWPYIEDELRRTAIDRKVSVKLLISCWQRSSFIMRNFLRSIAMLKSK
NIDIEVKLFIVPDTDPPIPYSRVNHAKYMVTDKTAYIGTSNWTGNYFTDTCGTSINITPD
DGLGLRQQLEDIFMRDWNSKYSYELYDTSPTKRCRLLKNMKQCTNDIYSDEIQPEKEIPE
YSLE
>CUI02274 |CPXV045 protein[Cowpox virus]
MSANCMFNLDNDYIYCKYWKPITYPKALVFISHGAGEHSGRYDELAENISSLGILVFSHD
HIGHGRSNGEKMMIDDFGTYVRDVVQHVVTIKSTYPGVPVFLLGHSMGATISILAAYENP
NLFTAMILMSPLVNAEAVPRLNLLAAKLMGAITPNAPVGKLCPESVSRDMDEVYKYQYDP
LVNHEKIKAGFASQVLKATNKVRKIIPKINTPSLILQGTNNEISDVSGAYYFMQHANCNR
EIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK
>CUI02276 |hypothetical protein pCPXV0240[Cowpox virus]
MDFCKIDVVVSFAHSLDNLINFINTIVPYSSIIELHQFLVESSTTGNIFVKHYNMISPRD
IFIY
在哪里
CUI02270;CUI02272;CUI02271;CUI022799
已被删除,因为文件1和2中都存在where
有人有主意吗
谢谢你的帮助 您正在处理FASTA文件,您可以使用awk轻松处理这些文件:
$ awk '(NR==FNR){list[$1];next}
/^>/{key=$0;sub(/^> */,"",key);sub(/ *[|].*$/,"",key);f=1}
(key in list) {f=0}
f' file1 file2
其工作方式如下:
(NR==FNR){list[$1];next}
:如果我们读取第一个文件(NR==FNR
),将条目存储在列表中并移动到下一条记录
/^>/{key=$0;sub(/^>*/,”,key);sub(/*[|].$/,”,key);f=1}
:如果遇到序列名,请提取位于
和第一个
之间的密钥。用sub
删除所有不相关的部分,并将标志f
初始化为1
。此标志指示是否要打印序列
检查键为的序列是否在列表中。如果是,请将标志f
设置为0
,因为我们不想打印
f
:如果f==1,这将打印该行
对不起,这不是StackOverflow的工作方式。形式为“我想做X,请给我提示和/或示例代码”的问题被认为是离题的。请访问并阅读,特别是阅读