Regex 如何在一个竖条之间移动长度可变的柱|&引用;及;[quot;?
我的文件有4000k行。我需要重新格式化它。所以,我正在尝试记事本++(或awk)。每行的结构都是 acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324 | pol protein制表器[人类免疫缺陷病毒1]制表器 第四个垂直条Regex 如何在一个竖条之间移动长度可变的柱|&引用;及;[quot;?,regex,awk,notepad++,bioinformatics,Regex,Awk,Notepad++,Bioinformatics,我的文件有4000k行。我需要重新格式化它。所以,我正在尝试记事本++(或awk)。每行的结构都是 acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324 | pol protein制表器[人类免疫缺陷病毒1]制表器 第四个垂直条|和第一个[之间的字符是可变长度的。只有我自己在寻找提示或关注点。我尝试用awk打印,但如何有一部分长度可变,我得到了不同的结果。我都不能按列选择 我想获得一个具有这种结构的文件 acc | GENBANK | ABJ91977.
|
和第一个[
之间的字符是可变长度的。只有我自己在寻找提示或关注点。我尝试用awk打印,但如何有一部分长度可变,我得到了不同的结果。我都不能按列选择
我想获得一个具有这种结构的文件
acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324,acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324,pol蛋白
和其他具有此结构的文件
acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324制表器
选项卡以粗体字母显示-制表器以下是处理第一个文件的方法
- Ctrl+H
- 查找内容:
(^[^ |]+(?:\\\\\[^ |]+){4}\\\\(.+?)\h+\[.+$
- 替换为:
$1,$1,$2
- 检查环绕
- 检查正则表达式
- 取消选中
。匹配换行符
- 全部替换
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
更换:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
给定示例的结果:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
屏幕截图:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
对于第二个文件:
- Ctrl+H
- 查找内容:
(^[^ |]+(?:\\\\\\\[^ |]+){4}\\\.+?\h+\[.+?\](.+)$
- 替换为:
$1$2
- 检查环绕
- 检查正则表达式
- 取消选中
。匹配换行符
- 全部替换
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
屏幕截图:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
制表器是什么意思?它是制表符吗?你真的想复制第5列吗?制表器是制表符。是的,需要复制输出。下一个管道使用此输出文件-_-