检查bash中是否存在远程文件_Bash_Wget_Gnu Parallel

检查bash中是否存在远程文件

bash

检查bash中是否存在远程文件,bash,wget,gnu-parallel,Bash,Wget,Gnu Parallel,我正在使用此脚本下载文件： parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg' 是否可以不下载文件，只在远程端检查它们，如果存在，则创建一个虚拟文件而不是下载比如： if wget --spider $url 2>/dev/null; then #touch img.file fi 应该可以

我正在使用此脚本下载文件：

parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'

是否可以不下载文件，只在远程端检查它们，如果存在，则创建一个虚拟文件而不是下载

比如：

if wget --spider $url 2>/dev/null; then
  #touch img.file
fi

应该可以工作，但我不知道如何将此代码与GNU并行结合

编辑：

根据Ole的回答，我编写了以下代码：

#!/bin/bash
do_url() {
  url="$1"
  wget -q -nc  --method HEAD "$url" && touch ./images/${url##*/}   
  #get filename from $url
  url2=${url##*/}
  wget -q -nc  --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url

parallel --progress -a urls.txt do_url {}

它可以工作，但对某些文件无效。我找不到为什么它对某些文件有效，为什么对其他文件无效的一致性。也许它有最后一个文件名。第二个wget尝试访问currect url，但之后的touch命令根本不会创建desired文件。第一个wget总是（正确地）下载主映像，而不使用_001.jpg、_002.jpg

示例URL.txt：

（工作正常，_001.jpg..u 005.jpg已下载）

（不工作，仅下载主映像）

您可以通过ssh发送命令，查看远程文件是否存在，如果存在，则对其进行cat：

ssh your_host 'test -e "somefile" && cat "somefile"' > somefile

还可以尝试支持全局表达式和递归的scp。

只需循环名称即可

for uname in ${url%.jpg}_{001..005}.jpg
do
  if wget --spider $uname 2>/dev/null; then
    touch ./images/${uname##*/}
  fi
done

您可以使用

curl

检查您正在解析的URL是否存在，而无需下载任何文件：

if curl --head --fail --silent "$url" >/dev/null; then
    touch .images/"${url##*/}"
fi

说明：

```
--fail
```
将使失败请求的退出状态为非零
```
--head
```
将避免下载文件内容
```
--无提示
```
将避免检查本身发出状态或错误

要解决“循环”问题，您可以执行以下操作：

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if curl --head --silent --fail "$url" > /dev/null; then
        touch .images/${url##*/}
    fi
done

据我所见，您的问题实际上不是如何使用

wget

测试文件是否存在，而是如何在shell脚本中执行正确的循环

以下是一个简单的解决方案：

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if wget -q --method=HEAD "$url"; then
        touch .images/${url##*/}
    fi
done

它使用

--method=HEAD

选项调用Wget。通过

HEAD

请求，服务器将简单地报告文件是否存在，而不返回任何数据

当然，对于大型数据集，这是非常低效的。您正在为尝试的每个文件创建到服务器的新连接。相反，正如另一个答案中所建议的，您可以使用GNUWGET2。使用wget2，您可以并行测试所有这些，并使用新的

--stats server

选项查找服务器提供的所有文件和特定返回代码的列表。例如：

$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}                                                             
Site Statistics:

  http://example.com:
    Status    No. of docs
       404              3
         http://example.com/3  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
         http://example.com/1  0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
         http://example.com/2  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
       200              1
         http://example.com/  0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)

您甚至可以将这些数据打印成CSV或JSON格式，以便更轻松地进行解析

很难理解您真正想要完成的是什么。让我试着重新表述你的问题

我有

url.txt

包含：

http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg

在

example.com

上，存在以下URL：

http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg

http://example.org/dira/foo_001.jpg

在

example.org

上存在以下URL：

http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg

http://example.org/dira/foo_001.jpg

给定

urls.txt

我想生成与_001.jpg.的组合_005.jpg添加到原始URL之外。例如：

http://example.com/dira/foo.jpg

变成：

http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg

然后我想测试这些URL是否存在而不下载文件。因为有很多URL，我想并行地做这件事

如果URL存在，我希望创建一个空文件

（版本1）：我希望在dir

images

中的类似目录结构中创建空文件。这是必要的，因为有些图像具有相同的名称，但在不同的目录中

因此，创建的文件应为：

images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg

images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg

（版本2）：我希望在dir

图像中创建空文件。这是因为所有图像都有唯一的名称
因此，创建的文件应为：
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg

images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg

（版本3）：我希望在dirimages
中创建一个空文件，该文件名来自urls.txt
。这是因为只有一个_001.jpg_005.jpg存在
images/foo.jpg
images/bar.jpg
images/baz.jpg

GNU并行每个作业需要几毫秒。当你的工作这么短时，开销会影响时间安排。如果没有一个CPU内核以100%速度运行，则可以并行运行更多作业：
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

您还可以“展开”循环。这将为每个URL节省5笔开销：
do_url() {
  url="$1"
  # Version 2:
  # If all the images have unique names and you want all images in a single dir
  wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
  wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
  wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
  wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
  wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
  wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url

parallel -j0 do_url {.} :::: urls.txt

最后，您可以运行250多个作业：
不，远程主机是http only.curl-我可以告诉您文件是否存在我问了这个问题，因为我不想下载任何文件，只需在远程端检查并生成一个本地虚拟文件（同名）如果存在，使用--方法HEAD
发送HEAD
请求，而不是GET
请求。可能重复@iamauser您是认真的吗？在这个问题上，关于检查远程端文件序列的单词在哪里？是的，我在。我认为您的问题应该是如何循环一系列文件，因为这是wget/curl
对每个调用的输入。在提供了一些答案之后完全更改您的问题是不好的。这使得这里提供的大多数答案看起来都是错误的。但是，问题是您在提供了问题后更改了问题。无法将所有图像保存到images/目录？我有一个很长的URL，这个脚本创建了一个奇怪的文件夹结构。添加了图像。我需要“版本2”。很好，谢谢。我做了一个小基准，我对速度感到失望。它比下载文件慢得多，如果您感兴趣，结果如下：。您认为瓶颈在哪里？250个作业（-j0）的运行时间现在减少了一半，但不幸的是，与wget相比，它仍然较慢——没有阻塞（如果存在，请不要下载）。但这是一个很好的答案，我肯定会在将来使用它。最新的例子有些奇怪：$ls images/_001.jpg _002.jpg _003.jpg _004.jpg _005.jpg。@最后我能够编译Wget2。对于快速测试，我运行了：wget2--spider--progress=none--stats site=csv:stat.csv${url%.jpg}{001..005}.jpg。它查询URL很好（example.com/hello_001.jpg，等等），但在stat.csv中只有一个，最后一个查询+我认为对于主图像（exampe.com/hello.jpg）），我还需要再运行一次Wget2