Bash 使用sed删除字符串末尾的句点（邮政编码）_Bash_Sed

Bash 使用sed删除字符串末尾的句点（邮政编码）

bash sed

Bash 使用sed删除字符串末尾的句点（邮政编码）,bash,sed,Bash,Sed,我有一个地址文件，我正试图清除它，我正在使用sed来清除不需要的字符和格式。在本例中，我的邮政编码后跟一个句点： Mr. John Doe Exclusively Stuff, 186 Caravelle Drive, Ponte Vedra FL 33487. （暂时不要理会新台词；我现在只关注邮政编码和时期）我想从拉链上去掉句号（.），这是我清理这个的第一步。我尝试在sed中使用子字符串，如下所示（使用“|”作为分隔符-我更容易看到）：不幸的是，它没有删除句点。它只是根据本文将其作

我有一个地址文件，我正试图清除它，我正在使用

sed

来清除不需要的字符和格式。在本例中，我的邮政编码后跟一个句点：

Mr. John Doe
Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL
33487.

（暂时不要理会新台词；我现在只关注邮政编码和时期）

我想从拉链上去掉句号（.），这是我清理这个的第一步。我尝试在sed中使用子字符串，如下所示（使用“|”作为分隔符-我更容易看到）：

不幸的是，它没有删除句点。它只是根据本文将其作为子字符串的一部分打印出来：

如果方向正确，我们将不胜感激。

您指定了4位

{4}

，但有5位，您必须转义

和

，例如：

sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt

请注意，点后面还有一个空格，因此您可能希望修剪五位数字后面的所有内容，但为了安全起见，您可能希望指定它们必须位于行首

在我的例子中，如果我键入的

info-sed

比

man-sed

更完整，我会发现：

'-r'
'--regexp-extended'
     Use extended regular expressions rather than basic regular
     expressions.  Extended regexps are those that 'egrep' accepts; they
     can be clearer because they usually have less backslashes, but are
     a GNU extension and hence scripts that use them are not portable.
     *Note Extended regular expressions: Extended regexps.

在

附录A扩展正则表达式下，您可以阅读：
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, braces ('{}'),
and '|'.  While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them _to match a
literal character_.  '|' is special here because '\|' is a GNU extension
- standard basic regular expressions do not provide its functionality.

Examples:
'abc?'
     becomes 'abc\?' when using extended regular expressions.  It
     matches the literal string 'abc?'.

'c\+'
     becomes 'c+' when using extended regular expressions.  It matches
     one or more 'c's.

'a\{3,\}'
     becomes 'a{3,}' when using extended regular expressions.  It
     matches three or more 'a's.

 '\(abc\)\{2,3\}'
     becomes '(abc){2,3}' when using extended regular expressions.  It
     matches either 'abcabc' or 'abcabcabc'.

 '\(abc*\)\1'
     becomes '(abc*)\1' when using extended regular expressions.
     Backreferences must still be escaped when using extended regular
     expressions.

您指定了4位{4}
，但有5位，并且必须转义{
和}
，例如：
sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt

请注意，点后面还有一个空格，因此您可能希望修剪五位数字后面的所有内容，但为了安全起见，您可能希望指定它们必须位于行首^

在我的例子中，如果我键入的info-sed
比man-sed
更完整，我会发现：
'-r'
'--regexp-extended'
     Use extended regular expressions rather than basic regular
     expressions.  Extended regexps are those that 'egrep' accepts; they
     can be clearer because they usually have less backslashes, but are
     a GNU extension and hence scripts that use them are not portable.
     *Note Extended regular expressions: Extended regexps.

在附录A扩展正则表达式下，您可以阅读：
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, braces ('{}'),
and '|'.  While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them _to match a
literal character_.  '|' is special here because '\|' is a GNU extension
- standard basic regular expressions do not provide its functionality.

Examples:
'abc?'
     becomes 'abc\?' when using extended regular expressions.  It
     matches the literal string 'abc?'.

'c\+'
     becomes 'c+' when using extended regular expressions.  It matches
     one or more 'c's.

'a\{3,\}'
     becomes 'a{3,}' when using extended regular expressions.  It
     matches three or more 'a's.

 '\(abc\)\{2,3\}'
     becomes '(abc){2,3}' when using extended regular expressions.  It
     matches either 'abcabc' or 'abcabcabc'.

 '\(abc*\)\1'
     becomes '(abc*)\1' when using extended regular expressions.
     Backreferences must still be escaped when using extended regular
     expressions.

基本解决方案：使用rangeatom来处理您发布的输入
一种简单（但有点幼稚）的方法是查找以下内容：
起跑线
后跟5位数字（标准美国邮政编码）
后跟零个或多个字符（例如ZIP+4）
后跟零个或多个非句点字符（与街道地址不匹配）
后跟一个文字句点
然后用捕获的部分替换整个比赛。例如：
sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt


使用BSD sed或不使用扩展表达式：
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'

sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'


使用GNU和扩展正则表达式：
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'

sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'



无论哪种方式，考虑到您发布的输入，您最终会得到：
Mr. John Doe
Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL
33487 

高级解决方案：正确处理邮政编码
主要需要注意的是，上面的解决方案适用于您发布的示例，但是如果邮政编码正确地位于地址最后一行的末尾，那么它将不匹配，因为它应该位于邮件中。如果您有自定义格式，这很好，但它可能会给您带来标准化或更正地址方面的问题，例如：
Mr. John Doe
12345 Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL 33487.

以下内容适用于您发布的输入和更典型的USPS地址，但您在其他非标准输入上的里程数可能会有所不同
# More reliable, but much harder to read.
sed -r 's/([[:digit:]]{5}(-[[:digit:]]{4})?[[:space:]]*)\.[[:space:]]*$/\1/'

基本解决方案：使用rangeatom来处理您发布的输入
一种简单（但有点幼稚）的方法是查找以下内容：
起跑线
后跟5位数字（标准美国邮政编码）
后跟零个或多个字符（例如ZIP+4）
后跟零个或多个非句点字符（与街道地址不匹配）
后跟一个文字句点
然后用捕获的部分替换整个比赛。例如：
sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt


使用BSD sed或不使用扩展表达式：
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'

sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'


使用GNU和扩展正则表达式：
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'

sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'



无论哪种方式，考虑到您发布的输入，您最终会得到：
Mr. John Doe
Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL
33487 

高级解决方案：正确处理邮政编码
主要需要注意的是，上面的解决方案适用于您发布的示例，但是如果邮政编码正确地位于地址最后一行的末尾，那么它将不匹配，因为它应该位于邮件中。如果您有自定义格式，这很好，但它可能会给您带来标准化或更正地址方面的问题，例如：
Mr. John Doe
12345 Exclusively Stuff, 186 
Caravelle Drive, Ponte Vedra FL 33487.

以下内容适用于您发布的输入和更典型的USPS地址，但您在其他非标准输入上的里程数可能会有所不同
# More reliable, but much harder to read.
sed -r 's/([[:digit:]]{5}(-[[:digit:]]{4})?[[:space:]]*)\.[[:space:]]*$/\1/'

这是有效的-你能解释一下为什么我需要避开括号吗？我使用的sed教程没有提到它。@Allan:如果你使用GNU sed并切换到带有选项-E
的扩展regexp，你可以使用这个：sed-E的|（[0-9]{5}）\.\1 | test.txt
@很高兴它能工作，我更新了我的答案以包含更多信息。希望是有帮助的。@archimiro-如果我能不止一次投票的话。谢谢你。@CodeGnome谢谢你的反馈！我不知道我居住在哥伦比亚的地址结构，而且非常不同，所以我只是根据OP的问题回答，而不是根据预期地址或邮政编码的结构。谢谢你的宝贵意见！这是有效的-你能解释一下为什么我需要避开括号吗？我使用的sed教程没有提到它。@Allan:如果你使用GNU sed并切换到带有选项-E
的扩展regexp，你可以使用这个：sed-E的|（[0-9]{5}）\.\1 | test.txt
@很高兴它能工作，我更新了我的答案以包含更多信息。希望是有帮助的。@archimiro-如果我能不止一次投票的话。谢谢你。@CodeGnome谢谢你的反馈！我不知道我居住在哥伦比亚的地址结构，而且非常不同，所以我只是根据OP的问题回答，而不是根据预期地址或邮政编码的结构。谢谢你的宝贵意见！