sed从文件中删除URL

sed从文件中删除URL,sed,Sed,我正在尝试编写一个sed表达式,可以从文件中删除URL 范例 http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:) Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&

我正在尝试编写一个sed表达式,可以从文件中删除URL

范例

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)   
但我不明白:

sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile  
已修复

处理几乎所有情况,甚至是格式错误的URL

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more

下面将删除
http://
https://
以及下一个空格之前的所有内容:

sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile  
 updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N  Thx to HMB Contributor @kdpartak :)

编辑:

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$ 
我应该使用:

sed -e 's!http[s]\?://\S*!!g' posFile
[s]\?
”是一种比“
\(s\){0,1\}
”更具可读性的书写“可选的
s
”的方式

\S*
”比“
[^[:space:][]*
更可读的“任何非空格字符”版本”

在我写这个答案时,我一定是在使用Mac电脑附带的
sed
brew install gnu sed
FTW)



有更好的URL正则表达式(例如,考虑HTTP以外的方案的URL正则表达式),但考虑到您给出的示例,这将对您有效。为什么要把事情复杂化?

公认的答案提供了我用来从文件中删除URL等的方法。然而,它留下了“空白”行。这里有一个解决方案

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
使用的GNU sed标志和表达式包括:

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*
-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file
但是,

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
保留非打印字符,可能是
\n
(换行符)。基于标准
sed
的删除“空白”行、制表符和空格的方法,例如

sed -i 's/^[ \t]*//; s/[ \t]*$//'
此处不工作:如果不使用“分支标签”处理换行符,则不能使用sed(一次读取一行输入)替换它们

解决方案是使用以下perl表达式:

perl -i -pe 's/^'`echo "\012"`'${2,}//g'
它使用了壳替换

  • “`echo”\012”`
替换八进制值

  • \012
(即换行符,
\n
),出现2次或更多次

  • {2,}
(否则我们会把所有的线都拆开),用别的东西;在这里:

  • /
i、 没有

[下面的第二个参考提供了这些值的精彩表格!]

使用的perl标志是:

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*
-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file
参考资料:

  • perl标志:
  • ASCII控制代码:
  • 删除URL:
  • 分支标签:
  • GNU sed手册:
  • 快速正则表达式指南:

示例:

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$ 

在处理URL、文件路径等时,我更喜欢使用“|”作为sed分隔符,这样就不必转义/。示例:sed的|/path/to/some/file/|/newpath/to/new/file/| g'@JP19,喜欢的话,会尝试一下这个方法。约翰西韦布,你能解释一下你的sed表达式吗?特别是{0,1}符号。感谢Mac的评论。我在mac电脑上测试了10分钟完全有效的正则表达式,然后我阅读了你的答案,并在centos机器上试用了它,这是我第一次使用它。对于任何想知道
的人来说!!g'
bit在编辑过的答案中,它只是一种转义所附文本的方法。根据我的测试,
sed-e的!http[s]\?://\s*!!g'
似乎与sed-e的/http[s]\?:\/\/\s*//g'