Text wget：从带有id号和url的列表中读取_Text_Wget_Directory

Text wget：从带有id号和url的列表中读取

text directory

Text wget：从带有id号和url的列表中读取,text,wget,directory,Text,Wget,Directory,在一个.txt文件中，我有500行代码，其中包含一个id号和一个网站主页URL，如下所示 id_345 http://www.example1.com id_367 http://www.example2.org ... id_10452 http://www.example3.net 使用wget和-i选项，我试图递归地下载这些网站的一部分，但我希望以与id号链接的方式存储文件（将文件存储在一个名为like id number的目录中，或者-最好的选择，但我认为最难实现-将html内容存

在一个.txt文件中，我有500行代码，其中包含一个id号和一个网站主页URL，如下所示

id_345  http://www.example1.com
id_367  http://www.example2.org
...
id_10452 http://www.example3.net

使用wget和-i选项，我试图递归地下载这些网站的一部分，但我希望以与id号链接的方式存储文件（将文件存储在一个名为like id number的目录中，或者-最好的选择，但我认为最难实现-将html内容存储在一个名为like id number的txt文件中）。不幸的是，选项-我无法读取像我正在使用的文件那样的文件。如何将网站内容与其连接的id链接

谢谢

附言：我想，要做到这一点，我必须从wget“走出去”，并通过脚本调用它。如果是这样，请考虑到我是这个领域的新手（只是一些python经验），特别是我还不能理解bash脚本中的逻辑和代码：因此非常欢迎对假人进行逐步解释。

使用Python中的
wget-p…-r-l…
递归获取站点，并进行并行处理（）：
使用Python将单个页面放入命名文件中：

import urllib2, re input_file = "site_list.txt" #open the site list file with open(input_file) as f: # loop through lines for line in f: # split out the id and url id_url = re.compile("\s+").split(line) print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..." try: # try to get the web page u = urllib2.urlopen(id_url[1]) # save the GET response data to the id file (appended with "html") localFile = open(id_url[0]+".html", 'wb+') localFile.write(u.read()) localFile.close() print "got " + id_url[0] + "!" except: print "Could not get " + id_url[0] + "!" pass
示例站点_list.txt：

id_345 http://www.stackoverflow.com id_367 http://stats.stackexchange.com
输出：

Grabbing http://www.stackoverflow.com into id_345.html... got id_345! Grabbing http://stats.stackexchange.com into id_367.html... got id_367!
目录列表：

get_urls.py id_345.html id_367.html site_list.txt

如果您喜欢命令行或shell脚本，您可以使用
awk
以空格处的默认拆分读取每一行，将其输送到循环中，并使用反勾号执行：

awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done
崩溃

使用
awk
工具读取site_list.txt文件的每一行，然后将空格处的每行（默认值）拆分为变量（
$1
，
$2
，
$3
，等等），这样您的id位于
$1
中，您的url位于
$2
中

添加
print
AWK命令来构造对
wget
的调用

添加管道操作符
|
，将输出发送到下一个命令

接下来，我们执行
wget
调用：

while read line ; do `$line` ; done

逐行循环之前的命令输出，将其存储到
$line
变量中，并使用backtick操作符执行它以解释文本并将其作为命令运行

使用Python中的
wget-p…-r-l…
递归获取站点，并进行并行处理（）：
使用Python将单个页面放入命名文件中：

import urllib2, re input_file = "site_list.txt" #open the site list file with open(input_file) as f: # loop through lines for line in f: # split out the id and url id_url = re.compile("\s+").split(line) print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..." try: # try to get the web page u = urllib2.urlopen(id_url[1]) # save the GET response data to the id file (appended with "html") localFile = open(id_url[0]+".html", 'wb+') localFile.write(u.read()) localFile.close() print "got " + id_url[0] + "!" except: print "Could not get " + id_url[0] + "!" pass
示例站点_list.txt：

id_345 http://www.stackoverflow.com id_367 http://stats.stackexchange.com
输出：

Grabbing http://www.stackoverflow.com into id_345.html... got id_345! Grabbing http://stats.stackexchange.com into id_367.html... got id_367!
目录列表：

get_urls.py id_345.html id_367.html site_list.txt

如果您喜欢命令行或shell脚本，您可以使用
awk
以空格处的默认拆分读取每一行，将其输送到循环中，并使用反勾号执行：

awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done
崩溃

使用
awk
工具读取site_list.txt文件的每一行，然后将空格处的每行（默认值）拆分为变量（
$1
，
$2
，
$3
，等等），这样您的id位于
$1
中，您的url位于
$2
中

添加
print
AWK命令来构造对
wget
的调用

添加管道操作符
|
，将输出发送到下一个命令

接下来，我们执行
wget
调用：

while read line ; do `$line` ; done

逐行循环之前的命令输出，将其存储到
$line
变量中，并使用backtick操作符执行它以解释文本并将其作为命令运行

我认为Python中的第一个解决方案非常有用，除了一件事：我使用wget，因为wget可以选择-r下载与给定url连接的页面到确定的深度。据我所知，我遇到的每个Python模块都很难获得这一点。我将立即尝试第二个解决方案。在同时：有可能在python程序中包含wget（而不是urllib2），或者这两个世界不容易混合？第二个解决方案工作得很好。我用这种方式稍微修改了代码：
awk'{print“wget-P”$1”“$2“-r-l2”}'lista.txt
。通过这种方式，我获得了一个名为like id的目录，其中包含了网站的大部分。很抱歉，也许我对这个问题不够清楚，我的问题是递归网站下载的情况，而不是单个URL（当然，Python在其中工作得非常好）.wwwslinger，所以我最后一个也是最后一个问题是：我可以将wget集成到您编写的python脚本中吗？我编辑了答案，以在python中包含一个示例，使用递归的
wget
（同时，因为递归的
wget
需要一段时间）。非常感谢。最后，linux解决方案工作顺利，但是知道wget可以以这种方式集成到python中是很有用的。我认为python中的第一个解决方案非常有用，除了一件事：我使用wget，因为wget有选项-r，可以下载与给定url连接到一定深度的页面。如据我所知，我遇到的每一个python模块都很难做到这一点。我将立即尝试第二种解决方案。同时：可以在python程序中以某种方式包含wget（而不是urllib2）或者这两个世界不容易混合？第二个解决方案非常有效。我用这种方式稍微修改了代码：
awk'{print“wget-P”$1”“$2“-r-l2”}'lista.txt
。通过这种方式，我获得了一个名为like id的目录，其中包含了网站的大部分。很抱歉，也许我对这个问题不够清楚，我的问题是递归网站下载的情况，而不是单个URL（当然，Python在其中工作得非常好）.wwwslinger，所以我最后一个也是最后一个问题是：我可以将wget集成到您编写的python脚本中吗？我编辑了答案，在python中包含了一个使用递归
wget
的示例（同时，因为递归
wget
需要一段时间）。非常感谢。最后，linux解决方案成功了