Python 有没有更好、更简单的方法来下载多个文件？_Python_Urllib

Python 有没有更好、更简单的方法来下载多个文件？

python

Python 有没有更好、更简单的方法来下载多个文件？,python,urllib,Python,Urllib,我在纽约市MTA网站上下载了一些转门数据，并想出了一个脚本，只下载Python上的2017年数据以下是脚本： import urllib import re html = urllib.urlopen('http://web.mta.info/developers/turnstile.html').read() links = re.findall('href="(data/\S*17[01]\S*[a-z])"', html) for link in links: txting

我在纽约市MTA网站上下载了一些转门数据，并想出了一个脚本，只下载Python上的2017年数据

以下是脚本：

import urllib
import re

html = urllib.urlopen('http://web.mta.info/developers/turnstile.html').read()
links = re.findall('href="(data/\S*17[01]\S*[a-z])"', html)

for link in links:
    txting = urllib.urlopen('http://web.mta.info/developers/'+link).read()
    lin = link[20:40]
    fhand = open(lin,'w')
    fhand.write(txting)
    fhand.close()

有没有更简单的方法来编写此脚本？正如@dizzyf所建议的，您可以使用BeautifulSoup从网页获取

href

值

from BS4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [link.get('href') for link in soup.find_all('a') 
                          if 'turnstile_17' in link.get('href')]

如果您不必使用Python获取文件（并且您使用的是系统），则可以将链接写入文件：

with open('url_list.txt','w') as url_file:
    for url in links:
        url_file.writeline(url)

然后使用

wget

下载它们：

$ wget -i url_list.txt

wget-i

将文件中的所有URL下载到当前目录中，并保留文件名。

正如@dizzyf所建议的，您可以使用BeautifulSoup从网页中获取

href

值

from BS4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [link.get('href') for link in soup.find_all('a') 
                          if 'turnstile_17' in link.get('href')]

如果您不必使用Python获取文件（并且您使用的是系统），则可以将链接写入文件：

with open('url_list.txt','w') as url_file:
    for url in links:
        url_file.writeline(url)

然后使用

wget

下载它们：

$ wget -i url_list.txt

wget-i

将文件中的所有URL下载到当前目录中，保留文件名。

下面的代码可以满足您的需要

import requests
import bs4
import time
import random
import re

pattern = '2017'
url_base = 'http://web.mta.info/developers/'
url_home = url_base + 'turnstile.html'
response = requests.get(url_home)
data = dict()

soup = bs4.BeautifulSoup(response.text)
links = [link.get('href') for link in soup.find_all('a', 
text=re.compile('2017'))]
for link in links:
    url = url_base + link
    print "Pulling data from:", url
    response = requests.get(url)
    data[link] = response.text # I don't know what you want to do with the data so here I just store it to a dict, but you could store it to a file as you did in your example.
    not_a_robot = random.randint(2, 15)
    print "Waiting %d seconds before next query." % not_a_robot
    time.sleep(not_a_robot) # some APIs will throttle you if you hit them too quickly

下面的代码应该满足您的需要

import requests
import bs4
import time
import random
import re

pattern = '2017'
url_base = 'http://web.mta.info/developers/'
url_home = url_base + 'turnstile.html'
response = requests.get(url_home)
data = dict()

soup = bs4.BeautifulSoup(response.text)
links = [link.get('href') for link in soup.find_all('a', 
text=re.compile('2017'))]
for link in links:
    url = url_base + link
    print "Pulling data from:", url
    response = requests.get(url)
    data[link] = response.text # I don't know what you want to do with the data so here I just store it to a dict, but you could store it to a file as you did in your example.
    not_a_robot = random.randint(2, 15)
    print "Waiting %d seconds before next query." % not_a_robot
    time.sleep(not_a_robot) # some APIs will throttle you if you hit them too quickly

不要使用正则表达式来解析html，我建议使用BeautifulSoup：不要使用正则表达式来解析html，我建议使用BeautifulSoup：非常感谢你的时间。非常感谢你的时间。非常感谢你的时间。非常感谢你的时间。非常感谢你的时间。非常感谢你的时间。