镜像网站使用BASH下载特定文件类型_Bash_Http_Wget_Mirror

镜像网站使用BASH下载特定文件类型

bash http

镜像网站使用BASH下载特定文件类型,bash,http,wget,mirror,Bash,Http,Wget,Mirror,我正在尝试归档几个网站的收藏。我希望能够保持他们是某种组织。因此，最好将它们存储在镜像目录结构中。下面是我的尝试 wget -m -x -e robots=off --no-parent --accept "*.ext" http://example.com 当使用“-m”选项时，它是否对其运行的距离有任何限制？（它会从网站上消失吗？永远消失吗？） wget -r -x -e robots=off --no-parent --accept "*.ext" --level 2 http://ex

我正在尝试归档几个网站的收藏。我希望能够保持他们是某种组织。因此，最好将它们存储在镜像目录结构中。下面是我的尝试

wget -m -x -e robots=off --no-parent --accept "*.ext" http://example.com

当使用“-m”选项时，它是否对其运行的距离有任何限制？（它会从网站上消失吗？永远消失吗？）

wget -r -x -e robots=off --no-parent --accept "*.ext" --level 2 http://example.com

这是最合理的方法吗？我知道“wget”有一个蜘蛛选项，它稳定吗

编辑

这就是我找到的解决方案。

我要查找的文件被标记并存储在服务器端的单个

dir

中。尝试

wget

的变体时。我能够得到链接和各种文件的结构，但我总是在循环中运行链接时遇到问题。所以我想出了这个办法。它可以工作，但速度很慢。关于如何提高效率有什么建议吗

我试图获取的网站和文件的结构

home
   ├──Foo
   │  ├──paul.mp3
   │  ├──saul.mp3
   │  ├──micheal.mp3
   │  ├──ring.mp3
   ├──Bar
      ├──nancy.mp3
      ├──jan.mp3
      ├──mary.mp3

taglist.txt
foo
bar

#!/bin/bash

#this script seems to work until the download part


URL="http://www.example.com"
LINK_FILE=taglist.txt

while read TAG; do
    mkdir "$TAG"
    cd "$TAG"

        # Get the URLs from the page
        wget -q $URL/$TAG -O - | \tr "\t\r\n'" '   "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt
        # Clean and sort URLs
        grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt
            # Download the page with the URL
            while read TAPE_URL; do
            #wget -r -A.mp3 $TAPE_URL
            wget -O tmp.$RANDOM $TAPE_URL
            done <tmp.curls.txt
            # Find all the .mp3 links in the files
            grep -r -o -E 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $TAG.mp3.list
            # Clean Up
            rm tmp.* 
            # Download the collected URLs
            wget -i $TAG.mp3.list
    cd ..   
done <"$LINK_FILE"

所以首先我创造了， 带有我想要的文件标签的文件

home
   ├──Foo
   │  ├──paul.mp3
   │  ├──saul.mp3
   │  ├──micheal.mp3
   │  ├──ring.mp3
   ├──Bar
      ├──nancy.mp3
      ├──jan.mp3
      ├──mary.mp3

taglist.txt
foo
bar

#!/bin/bash

#this script seems to work until the download part


URL="http://www.example.com"
LINK_FILE=taglist.txt

while read TAG; do
    mkdir "$TAG"
    cd "$TAG"

        # Get the URLs from the page
        wget -q $URL/$TAG -O - | \tr "\t\r\n'" '   "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt
        # Clean and sort URLs
        grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt
            # Download the page with the URL
            while read TAPE_URL; do
            #wget -r -A.mp3 $TAPE_URL
            wget -O tmp.$RANDOM $TAPE_URL
            done <tmp.curls.txt
            # Find all the .mp3 links in the files
            grep -r -o -E 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $TAG.mp3.list
            # Clean Up
            rm tmp.* 
            # Download the collected URLs
            wget -i $TAG.mp3.list
    cd ..   
done <"$LINK_FILE"

脚本

home
   ├──Foo
   │  ├──paul.mp3
   │  ├──saul.mp3
   │  ├──micheal.mp3
   │  ├──ring.mp3
   ├──Bar
      ├──nancy.mp3
      ├──jan.mp3
      ├──mary.mp3

taglist.txt
foo
bar

#!/bin/bash

#this script seems to work until the download part


URL="http://www.example.com"
LINK_FILE=taglist.txt

while read TAG; do
    mkdir "$TAG"
    cd "$TAG"

        # Get the URLs from the page
        wget -q $URL/$TAG -O - | \tr "\t\r\n'" '   "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt
        # Clean and sort URLs
        grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt
            # Download the page with the URL
            while read TAPE_URL; do
            #wget -r -A.mp3 $TAPE_URL
            wget -O tmp.$RANDOM $TAPE_URL
            done <tmp.curls.txt
            # Find all the .mp3 links in the files
            grep -r -o -E 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $TAG.mp3.list
            # Clean Up
            rm tmp.* 
            # Download the collected URLs
            wget -i $TAG.mp3.list
    cd ..   
done <"$LINK_FILE"

#/bin/bash
#在下载部分之前，此脚本似乎一直有效
URL=”http://www.example.com"
LINK_FILE=taglist.txt
读取标签时；做
mkdir“$TAG”
cd“$TAG”
#从页面获取URL
wget-q$URL/$TAG-O-|\tr“\t\r\n''''''''.\grep-i-O'.]\+href[]*=[\t]*“\（ht\\\\f\）tps\？：[^”]\+“'.\sed-e/^.*（[^”]\+\”*$/\1/g'>tmp.urls.txt
#清理和排序URL
格雷普-我很高兴http://www.example.com/storage_dir/*'tmp.urls.txt | sort-u>tmp.curls.txt
#下载带有URL的页面
读磁带时，请
#wget-r-A.mp3$TAPE\u URL
wget-O tmp.$RANDOM$TAPE\u URL
完成$TAG.mp3.list
#清理
rm tmp。*
#下载收集的URL
wget-i$TAG.mp3.list
光盘
通过阅读wget
的man
页面，您将看到以下问题的答案：

-m
相当于-r-N-l inf--no remove listing
，这意味着它将（A）递归，（B）仅从服务器下载比您已有版本更新的文件，（C）不将自身限制为任何递归深度，以及（D）沿途保留占位符文件，以确保已获取所有文件
是的，无论链接到哪里，递归都会跟随链接，这就是为什么默认递归深度为5。但是，通过使用-m
，您关闭了深度限制，因此您可以将整个internet下载到您的计算机上。这就是为什么您应该阅读递归接受/拒绝选项部分的原因man
页面的。它告诉您有关如何限制递归的所有信息。例如，您可以指定仅遵循特定域内的链接

-r
和--level2
肯定会限制您的递归，但它（A）不能保证您不会访问其他站点，并且（B）几乎肯定会错过大量您想要镜像的站点
--spider
不用于下载文件；它只是用于访问页面

请注意，即使使用-m
指令，您也很可能无法捕获真正镜像整个站点所需的所有文件。您需要使用-p
选项获取访问的每个页面的所有页面先决条件