Shell 将wget spider输出格式化为仅包含成功的URL_Shell_Awk_Sed_Scripting_Wget

Shell 将wget spider输出格式化为仅包含成功的URL

shell awk sed scripting

Shell 将wget spider输出格式化为仅包含成功的URL,shell,awk,sed,scripting,wget,Shell,Awk,Sed,Scripting,Wget,我正在从bash shell脚本运行wget，如下所示： input=$1 #iterate input text file line by line and run following on each line: wget -a links.log -nv --spider line_n_url $ wget -nv --spider http://google.com 2>&1 | awk '/200 OK/{print $4}' http://www.google.nl

我正在从bash shell脚本运行wget，如下所示：

input=$1

#iterate input text file line by line and run following on each line:

wget -a links.log -nv --spider line_n_url

$ wget -nv --spider http://google.com 2>&1 | awk '/200 OK/{print $4}'
http://www.google.nl/?gfe_rd=cr&dcr=0&ei=qgHdWa2MEqTVXsONudgM

问题是输出有许多404错误，甚至确实存在的url的格式如下：

2017-10-10 11:35:46 URL: http://someurl.com/somefile.ext 200 OK

有没有一种方法可以格式化wget编写的输出，或者轻松地对其进行排序

另一个问题是.ext有三种可能的类型，这使得匹配更加困难

我想要的是每个现有的URL在自己的行中没有时间戳，

URL:

或

200OK

http://someurl.com/somefile.ext
http://someurl.com/somefile2.ex2
http://someurl.com/somefile3.exp

谢谢。

据我所知，您试图只过滤

200条OK

消息。您应该在这里查看awk，这样您就可以在bash脚本中执行类似的操作：

$ wget -a links.log -nv --spider line_n_url 2>&1 | awk '/200 OK/{print $4}'
http://someurl.com/somefile.ext

如果需要唯一的URL，可以执行以下操作：

awk '/200 OK/{print $4}' | sort | uniq

或：

重要提示：必须将stderr重定向到stdout，如下所示：

input=$1

#iterate input text file line by line and run following on each line:

wget -a links.log -nv --spider line_n_url

$ wget -nv --spider http://google.com 2>&1 | awk '/200 OK/{print $4}'
http://www.google.nl/?gfe_rd=cr&dcr=0&ei=qgHdWa2MEqTVXsONudgM

你可能需要对它们进行分类。是的，你可能需要。让我们温柔点。我将把它添加到答案中。我想uniq可能需要根据输入对它们进行排序。我不是故意迟钝的。：）别担心。我的错误。我已经让它与

cat output.log | awk'/200 OK/{print$4}'

一起工作，但我无法让它与脚本中的wget一起工作。output.log和output to shell仍然包括所有其他行和不需要的字符，即使我使用了您提到的管道。。我做错了什么，或者wget不喜欢被吹笛？