String 如何使用sed/grep提取两个单词之间的文本?
我试图输出一个字符串,该字符串包含字符串中两个单词之间的所有内容: 输入:String 如何使用sed/grep提取两个单词之间的文本?,string,bash,sed,grep,String,Bash,Sed,Grep,我试图输出一个字符串,该字符串包含字符串中两个单词之间的所有内容: 输入: "Here is a String" 输出: "is a" 使用: sed -n '/Here/,/String/p' 包含端点,但我不想包含它们。您可以使用\1(请参阅): sed -e 's/Here\(.*\)String/\1/' 括号内的内容将存储为\1您可以单独将字符串剥离: 如果您有一个包含的GNU grep,您可以使用零宽度断言: $ echo "Here is a String" | grep
"Here is a String"
输出:
"is a"
使用:
sed -n '/Here/,/String/p'
包含端点,但我不想包含它们。您可以使用
\1
(请参阅):
sed -e 's/Here\(.*\)String/\1/'
括号内的内容将存储为\1
您可以单独将字符串剥离:
如果您有一个包含的GNU grep,您可以使用零宽度断言:
$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a
$echo“这是一个字符串”| grep-Po'(?这可能适合您(GNU-sed):
这将在换行符上的两个标记(在本例中为此处
和字符串
)之间显示文本的每个表示形式,并在文本中保留换行符。GNU grep还可以支持正向和负向前瞻和反向前瞻:
对于您的情况,命令将是:
echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'
如果您有一个长文件,其中包含许多多行文档,则首先打印数字行非常有用:
cat -n file | sed -n '/Here/,/String/p'
通过GNU awk
$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
is a
带有-p
(perl regexp)参数的grep支持\K
,这有助于丢弃以前匹配的字符。在我们的例子中,以前匹配的字符串在这里是,因此它从最终输出中被丢弃
$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
is a
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
is a
如果您希望输出为是一个
,则可以尝试以下方法
$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a
以上所有的解决方案都有缺陷,最后一个搜索字符串在字符串的其他地方重复。我发现最好编写一个bash函数
function str_str {
local str
str="${1#*${2}}"
str="${str%%$3*}"
echo -n "$str"
}
# test it ...
mystr="this is a string"
str_str "$mystr" "this " " string"
接受的答案不会删除可能位于此处
之前或字符串
之后的文本。这将:
sed -e 's/.*Here\(.*\)String.*/\1/'
主要的区别是在之前和字符串之后添加了*
。我存储的Claws邮件按如下方式包装,我正在尝试提取主题行:
Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
link in major cell growth pathway: Findings point to new potential
therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
identified [Lysosomal amino acid transporter SLC38A9 signals arginine
sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>
解决方案1。
给
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
解决方案2.*
将用空格替换换行符
将其与A2链接在一起,我们得到:
此变体删除了两个空格:
sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'
要理解sed
命令,我们必须一步一步地构建它
这是你的原文
user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$
让我们在这里尝试删除字符串,在sed中使用s
ubstition选项
user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$
在这一点上,我相信您也可以删除String
user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$
但这不是您想要的输出
要组合两个sed命令,请使用-e
选项
user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$
希望这有帮助您可以使用两个s命令
$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
is a
也有效
$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a
$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a
如果输入是Here is a Here String
?或我在此称您为Stringy先生
?仅供参考。您的命令意味着打印包含单词Here的行和包含单词String的行之间的所有内容,而不是您想要的内容。另一个常见的sed
常见问题是“如何提取特定行之间的文本”;这是谢谢!如果我想在“Here is a a a a a a a String”中找到“one is”和“String”之间的所有内容,该怎么办?(sed-e's/one is(.*)String/\1/'?@user1190650,如果您想看到“Here is a a”也可以使用它。您可以测试它:echo“Here is a a a a a a a a a a a a a a a String”“| sed-e's/one is\(.*)String/\1/'
。如果您只想要“one is”和“String”之间的部分,则需要使正则表达式与整行匹配:sed-e's/*one is\(.*)String./\1/'
。在sed中,s/pattern/replacement/
在每行上用“pattern”替换“replacement”。它只会更改与“模式”匹配的任何内容,因此如果您希望它替换整行,则需要使“模式”与整行匹配。当输入为此处是字符串Here is a String
时,这会中断。如果能看到案例的解决方案,那将非常好:“这里是一个废话串这里是1个废话串这里是2个废话串”输出应该只拾取这里和字符串之间的第一个子串@JayD sed不支持非贪婪匹配,请参阅以获得一些推荐的替代方法。为什么此方法如此缓慢?当使用此方法剥离大型html页面时,大约需要10秒。@AdamJohns,哪种方法?PCRE?PCRE解析相当复杂,但10秒似乎非常极端。如果您担心,我建议您包括考试ple代码,看看专家怎么说。我认为它对我来说太慢了,因为它在一个变量中保存了一个非常大的html文件的源代码。当我将内容写入文件,然后解析文件时,速度显著提高。请注意,GNU grep的-P
选项不存在于*BSD中包含的grep
中,或者在在FreeBSD中,您可以安装devel/pcre
端口,该端口包括pcregrep
,该端口支持pcre(以及向前/向后看)。较旧版本的OSX使用GNU grep,但在OSX Mavericks中,-P
是从FreeBSD的版本派生的,该版本不包括该选项。嗨,我如何仅提取不同的内容?这不起作用,因为如果您的结束字符串“string”多次出现,它将获得最后一次出现的结果,而不是下一次出现的结果。如果这里是字符串a字符串
,则“都是”
,“都是字符串a”
都是有效答案(忽略引号),根据问题要求。这取决于您想要其中哪一个,然后答案可能会相应不同。无论如何,对于您的要求,这将起作用:echo“Here is string a string a string”| grep-o-P'(?@BND,您需要启用。echo$“Here is\na string”| grep-zoP'(?谢谢!这是在我的情况下唯一有效的解决方案(多行文本文件,而不是没有换行符的单个字符串)。显然,要使其不带行号,必须省略cat
中的-n
选项……在这种情况下,cat
可以完全省略;sed
sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$
user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$
user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$
user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$
$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
is a
$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a
$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
is a