wget`--拒绝regex`不工作？_Regex_Download_Wget

wget`--拒绝regex`不工作？

regex download

wget`--拒绝regex`不工作？,regex,download,wget,Regex,Download,Wget,为什么以下命令能够从www.example.com下载index.html wget——拒绝正则表达式。*http://www.example.com/ $ wget --reject-regex .* http://www.example.com/ --2018-03-05 11:21:26-- http://.keystone_install_lock/ Resolving .keystone_install_lock... failed: nodename nor servname pr

为什么以下命令能够从

www.example.com

下载

index.html

wget——拒绝正则表达式。*http://www.example.com/

$ wget --reject-regex .* http://www.example.com/
--2018-03-05 11:21:26--  http://.keystone_install_lock/
Resolving .keystone_install_lock... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘.keystone_install_lock’
--2018-03-05 11:21:26--  http://www.example.com/
Resolving www.example.com... 93.184.216.34
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: ‘index.html’

index.html                                                    100%[=================================================================================================================================================>]   1.24K  --.-KB/s    in 0s

2018-03-05 11:21:27 (4.49 MB/s) - ‘index.html’ saved [1270/1270]

FINISHED --2018-03-05 11:21:27--
Total wall clock time: 0.4s
Downloaded: 1 files, 1.2K in 0s (4.49 MB/s)

wget

的手册页显示

--接受正则表达式urlregex

--拒绝正则表达式urlregex

指定正则表达式以接受或拒绝完整的URL

正则表达式

匹配所有内容。（您可以使用以下方法进行验证）

我认为所有

wget

下载都将被拒绝，因为

--拒绝regex.

选项

匹配

www.example.com

，不是吗

为什么wget不忽略

www.example.com

中的所有内容？

--regect regex

只会拒绝URL链接，而不会拒绝

index.html

中的标记文本。例如，如果网站包含指向CSS文件

main.CSS

的URL，则此命令将递归下载网站，但排除

main.CSS

：

wget -r --reject-regex 'main.css' www.somewebsite.com

要忽略网站中的某些文本，请使用

sed

。举几个例子：

# Ignores the word 'Sans'
wget -qO- example.com | sed "s/Sans//g" > index.html

# Ignores everything
wget -qO- example.com | sed "s/.*//g" > index.html

使用

-np

选项拒绝索引文件

--reject regex

仅适用于递归文件（来自索引文件的任何链接）

部分原因是命令中的

很可能被shell扩展为当前工作目录中的匹配文件名列表，因为它没有包含在适当的引号中。您得到的输出中的

.keystone\u install\u lock

很可能是当前工作目录中的文件名。wget甚至在尝试连接到之前报告它。试一试

或者可以使用

”

而不是

”

，具体取决于您使用的shell

使用该命令，我仍然可以检索index.html，因此我的答案并不完整

使用Quantum7建议的

-np

，我仍然得到index.html，因此这也不能完成答案。

那么为什么

www.example.com

没有被

拒绝*

www.example.com

是一个URL链接，不是吗？因为

拒绝regex.*.

将拒绝

www.example.com

中的所有URL。它不会拒绝

www.example.com

中的所有文本。换句话说，

--reject regex

只拒绝给定网站中的URL，而不是网站的实际文本。

   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.
       This is a useful option, since it guarantees that only the
       files below a certain hierarchy will be downloaded.

wget --reject-regex '.*' http://www.example.com/