Web scraping wget下载nofollow链接_Web Scraping_Web Crawler_Wget

Web scraping wget下载nofollow链接

web-scraping web-crawler

Web scraping wget下载nofollow链接,web-scraping,web-crawler,wget,Web Scraping,Web Crawler,Wget,我想用wget抓取/刮取wordpress网站。问题：wget将下载文档/链接，尽管它们具有rel=nofollow属性。是的，我允许robots.txt 例如： wget--mirror--page requisites--adjust extension--convert links--restrict file names=windows--no parent--span hosts--domains=randomscii.wordpress.com，wp.comhttps://rand

我想用wget抓取/刮取wordpress网站。
问题：wget将下载文档/链接，尽管它们具有

rel=nofollow

属性。是的，我允许robots.txt

例如：

wget--mirror--page requisites--adjust extension--convert links--restrict file names=windows--no parent--span hosts--domains=randomscii.wordpress.com，wp.comhttps://randomascii.wordpress.com/about/

现在打开

about

文件夹，几秒钟后，您将看到几十个html文件，它们来自nofollow链接：

index。html@share=reddit.html

，

索引。html@share=twitter.html

，

索引。html@replytocom=74214.html

GNU Wget 1.20.3 built on msys.

-cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/openssl

Wgetrc:
    /etc/wgetrc (system)
Locale:
    /usr/share/locale
Compile:
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib -DHAVE_LIBSSL
    -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe
Link:
    gcc -DHAVE_LIBSSL -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe
    -pipe -lmetalink -lexpat -lpcre2-8 -luuid -lssl -lcrypto -lz -lz
    -lpsl -lidn2 -liconv -lunistring -lgpgme -lassuan -lgpg-error
    ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a -liconv -lintl
    /usr/lib/libunistring.dll.a

这可能应该在superuser上而不是在这里问。在写这篇文章之前，我确实考虑过这一点，还有一个关于这个的主题（），但总体而言，wget问题似乎比现在（800:3400）多问了4倍，因此我在这里问。也许这更像是一个错误报告。。。至少如果有人能证实它不起作用，问题不在我。Debian上也会出现这种情况，所以它不是我的wget版本。