String 如何使用sed/grep提取两个单词之间的文本？_String_Bash_Sed_Grep

String 如何使用sed/grep提取两个单词之间的文本？

string bash sed grep

String 如何使用sed/grep提取两个单词之间的文本？,string,bash,sed,grep,String,Bash,Sed,Grep,我试图输出一个字符串，该字符串包含字符串中两个单词之间的所有内容：输入： "Here is a String" 输出： "is a" 使用： sed -n '/Here/,/String/p' 包含端点，但我不想包含它们。您可以使用\1（请参阅）： sed -e 's/Here$.*$String/\1/' 括号内的内容将存储为\1您可以单独将字符串剥离：如果您有一个包含的GNU grep，您可以使用零宽度断言： $ echo "Here is a String" | grep

我试图输出一个字符串，该字符串包含字符串中两个单词之间的所有内容：

输入：

"Here is a String"

输出：

"is a"

使用：

sed -n '/Here/,/String/p'

包含端点，但我不想包含它们。

您可以使用

\1

（请参阅）：

sed -e 's/Here\(.*\)String/\1/'

括号内的内容将存储为

\1

您可以单独将字符串剥离：

如果您有一个包含的GNU grep，您可以使用零宽度断言：

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

$echo“这是一个字符串”| grep-Po'（？这可能适合您（GNU-sed）：
这将在换行符上的两个标记（在本例中为此处
和字符串
）之间显示文本的每个表示形式，并在文本中保留换行符。GNU grep还可以支持正向和负向前瞻和反向前瞻：
对于您的情况，命令将是：
echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果您有一个长文件，其中包含许多多行文档，则首先打印数字行非常有用：
cat -n file | sed -n '/Here/,/String/p'

通过GNU awk
$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a 

带有-p
（perl regexp）参数的grep支持\K
，这有助于丢弃以前匹配的字符。在我们的例子中，以前匹配的字符串在这里是，因此它从最终输出中被丢弃
$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a 

如果您希望输出为是一个
，则可以尝试以下方法
$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

以上所有的解决方案都有缺陷，最后一个搜索字符串在字符串的其他地方重复。我发现最好编写一个bash函数
    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

接受的答案不会删除可能位于此处
之前或字符串
之后的文本。这将：
sed -e 's/.*Here\(.*\)String.*/\1/'

主要的区别是在之前和字符串之后添加了*
。我存储的Claws邮件按如下方式包装，我正在尝试提取主题行：
Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

解决方案1。

给
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

解决方案2.*

将用空格替换换行符
将其与A2链接在一起，我们得到：
此变体删除了两个空格：
sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

要理解sed
命令，我们必须一步一步地构建它
这是你的原文
user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$ 

让我们在这里尝试删除字符串，在sed中使用s
ubstition选项
user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$ 

在这一点上，我相信您也可以删除String

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$ 

但这不是您想要的输出
要组合两个sed命令，请使用-e
选项
user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$ 

希望这有帮助
您可以使用两个s命令
$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a 

也有效
$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a 

如果输入是Here is a Here String
？或我在此称您为Stringy先生
？仅供参考。您的命令意味着打印包含单词Here的行和包含单词String的行之间的所有内容，而不是您想要的内容。另一个常见的sed
常见问题是“如何提取特定行之间的文本”；这是谢谢！如果我想在“Here is a a a a a a a String”中找到“one is”和“String”之间的所有内容，该怎么办？（sed-e's/one is（.*）String/\1/'？@user1190650，如果您想看到“Here is a a”也可以使用它。您可以测试它：echo“Here is a a a a a a a a a a a a a a a String”“| sed-e's/one is\（.*）String/\1/'
。如果您只想要“one is”和“String”之间的部分，则需要使正则表达式与整行匹配：sed-e's/*one is\（.*）String./\1/'
。在sed中，s/pattern/replacement/
在每行上用“pattern”替换“replacement”。它只会更改与“模式”匹配的任何内容，因此如果您希望它替换整行，则需要使“模式”与整行匹配。当输入为此处是字符串Here is a String
时，这会中断。如果能看到案例的解决方案，那将非常好：“这里是一个废话串这里是1个废话串这里是2个废话串”输出应该只拾取这里和字符串之间的第一个子串@JayD sed不支持非贪婪匹配，请参阅以获得一些推荐的替代方法。为什么此方法如此缓慢？当使用此方法剥离大型html页面时，大约需要10秒。@AdamJohns，哪种方法？PCRE？PCRE解析相当复杂，但10秒似乎非常极端。如果您担心，我建议您包括考试ple代码，看看专家怎么说。我认为它对我来说太慢了，因为它在一个变量中保存了一个非常大的html文件的源代码。当我将内容写入文件，然后解析文件时，速度显著提高。请注意，GNU grep的-P
选项不存在于*BSD中包含的grep
中，或者在在FreeBSD中，您可以安装devel/pcre
端口，该端口包括pcregrep
，该端口支持pcre（以及向前/向后看）。较旧版本的OSX使用GNU grep，但在OSX Mavericks中，-P
是从FreeBSD的版本派生的，该版本不包括该选项。嗨，我如何仅提取不同的内容？这不起作用，因为如果您的结束字符串“string”多次出现，它将获得最后一次出现的结果，而不是下一次出现的结果。如果这里是字符串a字符串
，则“都是”
，“都是字符串a”
都是有效答案（忽略引号），根据问题要求。这取决于您想要其中哪一个，然后答案可能会相应不同。无论如何，对于您的要求，这将起作用：echo“Here is string a string a string”| grep-o-P'（？@BND，您需要启用。echo$“Here is\na string”| grep-zoP'（？谢谢！这是在我的情况下唯一有效的解决方案（多行文本文件，而不是没有换行符的单个字符串）。显然，要使其不带行号，必须省略cat
中的-n
选项……在这种情况下，cat
可以完全省略；sedsed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$ 

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$ 

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$ 

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$ 

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a 

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a