Regex 如何在一个竖条之间移动长度可变的柱|&引用;及;[quot;?

Regex 如何在一个竖条之间移动长度可变的柱|&引用;及;[quot;?,regex,awk,notepad++,bioinformatics,Regex,Awk,Notepad++,Bioinformatics,我的文件有4000k行。我需要重新格式化它。所以,我正在尝试记事本++(或awk)。每行的结构都是 acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324 | pol protein制表器[人类免疫缺陷病毒1]制表器 第四个垂直条|和第一个[之间的字符是可变长度的。只有我自己在寻找提示或关注点。我尝试用awk打印,但如何有一部分长度可变,我得到了不同的结果。我都不能按列选择 我想获得一个具有这种结构的文件 acc | GENBANK | ABJ91977.

我的文件有4000k行。我需要重新格式化它。所以,我正在尝试记事本++(或awk)。每行的结构都是

acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324 | pol protein制表器[人类免疫缺陷病毒1]制表器

第四个垂直条
|
和第一个
[
之间的字符是可变长度的。只有我自己在寻找提示或关注点。我尝试用awk打印,但如何有一部分长度可变,我得到了不同的结果。我都不能按列选择

我想获得一个具有这种结构的文件

acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324,acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324,pol蛋白

和其他具有此结构的文件

acc | GENBANK | ABJ91977.1 | GENBANK | DQ876324制表器


选项卡以粗体字母显示-制表器

以下是处理第一个文件的方法

  • Ctrl+H
  • 查找内容:
    (^[^ |]+(?:\\\\\[^ |]+){4}\\\\(.+?)\h+\[.+$
  • 替换为:
    $1,$1,$2
  • 检查环绕
  • 检查正则表达式
  • 取消选中
    。匹配换行符
  • 全部替换
说明:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line
更换:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line
给定示例的结果:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line
屏幕截图:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line


对于第二个文件:

  • Ctrl+H
  • 查找内容:
    (^[^ |]+(?:\\\\\\\[^ |]+){4}\\\.+?\h+\[.+?\](.+)$
  • 替换为:
    $1$2
  • 检查环绕
  • 检查正则表达式
  • 取消选中
    。匹配换行符
  • 全部替换
说明:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line
屏幕截图:

(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
(.+?)           # group 2, 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+              # 1 or more any character but newline
$               # end of line
$1              # content of group 1 
,               # a comma
$1              # content of group 1 
,               # a comma
$2              # content of group 2
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
(               # group 1
  ^             # beginning of line
  [^|]+         # 1 or more non pipe
  (?:           # start non capture group
    \|          # a pipe
    [^|]+       # 1 or more non pipe
  ){4}          # end group, must appear 4 times
)               # end group 1
\|              # a pipe
.+?             # 1 or more any character but newline, not greedy
\h+             # 1 or more horizontal spaces (space or tabulation)
\[              # 1 openning square bracket
.+?             # 1 or more any character but newline, not greedy
\]              # a closing square bracket
(.+)            # group 2, 1 or more any character but newline
$               # end of line

制表器是什么意思?它是制表符吗?你真的想复制第5列吗?制表器是制表符。是的,需要复制输出。下一个管道使用此输出文件-_-