Regex 文章（a/an/the）和数字（1-4位）之间的正则表达式短语_Regex_Bash_Sed_Awk_Grep

Regex 文章（a/an/the）和数字（1-4位）之间的正则表达式短语

regex bash sed awk grep

Regex 文章（a/an/the）和数字（1-4位）之间的正则表达式短语,regex,bash,sed,awk,grep,Regex,Bash,Sed,Awk,Grep,我尝试了几十种正则表达式的排列来解决这个问题，但我没有遇到任何运气我需要遍历几十个文件，在“the/a/an”和一个可能数字为1-4的数字之间提取特定短语，忽略标点符号，如{}（）[] 示例敏捷的棕色狐狸以某种方式跳过懒惰的狗这绝对不适合所有观众（0012）应返回：敏捷的棕色狐狸懒惰的狗某种方式4 观众0012 消除标点符号不是问题：sed的/[[[{}（）]//g' 有什么建议吗 grep -ioP "(a|an|the).*?\d{1,4}" files -o将只打印匹配的文

我尝试了几十种正则表达式的排列来解决这个问题，但我没有遇到任何运气

我需要遍历几十个文件，在“the/a/an”和一个可能数字为1-4的数字之间提取特定短语，忽略标点符号，如{}（）[]

示例

敏捷的棕色狐狸以某种方式跳过懒惰的狗这绝对不适合所有观众（0012）

应返回：

敏捷的棕色狐狸

懒惰的狗

某种方式4

观众0012

消除标点符号不是问题：

sed的/[[[{}（）]//g'

有什么建议吗

grep -ioP "(a|an|the).*?\d{1,4}" files

-o

将只打印匹配的文本，并在其自己的行上打印每个匹配项

-P

用于不情愿的量词，并使正则表达式自动扩展。当然，您可以按照上面的建议将此输出管道传输到

sed

。

在GNU awk中，您可以将输入拆分为以数字结尾的记录，这些数字可以选择用标点符号包围：

$ cat file
The quick brown fox {15} jumps over the lazy dog [20] in a certain way 4 that is definitely not appropriate for all of the viewers (0012).


$ gawk -v RS='[[:punct:]]*[[:digit:]]+[[:punct:]]*' 'RT{print $0 RT}' file
The quick brown fox {15}
 jumps over the lazy dog [20]
 in a certain way 4
 that is definitely not appropriate for all of the viewers (0012).

然后，您只需打印所需的记录部分和记录终止符：

$ gawk -v RS='[[:punct:]]*[[:digit:]]+[[:punct:]]*' 'RT{print gensub(/.*\y(the|a|an)\y/,"\\1","") gensub(/[[:punct:]]/,"","g",RT)}' file
The quick brown fox 15
the lazy dog 20
a certain way 4
the viewers 0012

我刚刚注意到，在您的示例中，您正在将输出转换为所有小写。只需在打印之前插入一个

$0=tolower（$0）

（也解决了

比较不区分大小写的问题）：
Pure Bash和正则表达式使用练习：
while read line ; do
  line=" $line"                                 # add leading space as word boundary

  while [ -n "$line" ] ; do
    [[ "$line" =~ [[:space:]]((an|a|the|An|A|The)([[:space:]]+[^[:digit:]]+)([[:digit:]]{1,4}))(.+$) ]]

    match="${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}"
    match=${match//[()\[\]\{\}]/}               # remove parentheses
    [ -n "$match" ] && echo "'$match'"          # print if not empty

    line="${BASH_REMATCH[5]}"                   # the postmatch
  done
done < "$infile"

这可能适用于您（GNU-sed）：
我更喜欢使用\b（？：a | an | the）\b[^\d.]+\d{1,4}\b
。客户端在OSX上，因此遗憾的是-P选项不可用。这和@Qtax让我走上了正确的道路：grep-iEo“\b（？：a | an | the）\b[^\d.]+\d{1,4}\b”感谢您的详细解释！我非常感激。
while read line ; do
  line=" $line"                                 # add leading space as word boundary

  while [ -n "$line" ] ; do
    [[ "$line" =~ [[:space:]]((an|a|the|An|A|The)([[:space:]]+[^[:digit:]]+)([[:digit:]]{1,4}))(.+$) ]]

    match="${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}"
    match=${match//[()\[\]\{\}]/}               # remove parentheses
    [ -n "$match" ] && echo "'$match'"          # print if not empty

    line="${BASH_REMATCH[5]}"                   # the postmatch
  done
done < "$infile"

'The quick brown fox 15'
'the lazy dog 20'
'a certain way 4'
'the viewers 0012'

sed -r '/\b(the|an|a)\b/I!d;s//\n&/;s/[^\n]*\n//;s/\{([0-9]{1,4})\}|\(([0-9]{1,4})\)|\[([0-9]{1,4})\]|\b([0-9]{1,4})\b/\1\2\3\4\n/;P;D' file