使用Python调度器和wget定期下载文件_Python_Beautifulsoup_Wget_Scheduler

使用Python调度器和wget定期下载文件

python

使用Python调度器和wget定期下载文件,python,beautifulsoup,wget,scheduler,Python,Beautifulsoup,Wget,Scheduler,我编写了一个简单的脚本，它使用schedule模块每周从网页下载一次文件。下载之前，它会检查文件是否使用BeautifulSoup更新。如果是，则使用wget下载文件。此外，其他脚本使用该文件执行计算问题是，在我手动中断脚本之前，该文件不会出现在目录中。所以，每次我都必须中断脚本并再次运行它，这样就可以安排在下周是否有机会在不中断脚本的情况下“即时”下载并保存文件守则如下： import wget import ssl import schedule import time from bs

我编写了一个简单的脚本，它使用

schedule

模块每周从网页下载一次文件。下载之前，它会检查文件是否使用

BeautifulSoup

更新。如果是，则使用

wget

下载文件。此外，其他脚本使用该文件执行计算

问题是，在我手动中断脚本之前，该文件不会出现在目录中。所以，每次我都必须中断脚本并再次运行它，这样就可以安排在下周

是否有机会在不中断脚本的情况下“即时”下载并保存文件

守则如下：

import wget
import ssl
import schedule
import time
from bs4 import BeautifulSoup
import datefinder
from datetime import datetime

# disable certificate checks
ssl._create_default_https_context = ssl._create_unverified_context


#checking if file was updated, if yes, download file, if not waiting until updated
def download_file():
    if check_for_updates():
        print("downloading")
        url = 'https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv'
        wget.download(url)
        print("downloading complete")
    else:
        print("sleeping")
        time.sleep(60)
        download_file()

# Checking if website was updated
def check_for_updates():
    url2 = 'https://fgisonline.ams.usda.gov/ExportGrainReport/default.aspx'
    html = urlopen(url2).read()
    soup = BeautifulSoup(html, "lxml")
    text_to_search = soup.body.ul.li.string
    matches = list(datefinder.find_dates(text_to_search[30:]))
    found_date = matches[0].date()
    today = datetime.today().date()
    return found_date == today


schedule.every().tuesday.at('09:44').do(download_file)

while True:
    schedule.run_pending()
    time.sleep(1)

根据以下线索，您应该能够解决您的问题：

从bs4导入美化组
导入请求
导入urllib3
urllib3.禁用_警告（）
def主（url）：
r=requests.head（url，verify=False）
打印（r.headers['Last-Modified']）
主要（”https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv")

输出：

Mon, 28 Sep 2020 15:02:22 GMT

现在，您可以每天按自己喜欢的时间通过

Cron

作业运行脚本，并在上次修改的文件

标题上循环，直到它与今天的日期相等，然后下载该文件
请注意，我使用了head
请求，该请求将以100倍的速度跟踪它。然后您可以使用请求。获取
我也更喜欢在同一时段工作
您需要指定输出目录。我认为，除非这样做，否则PyCharm会保存在temp目录的某个地方，当您停止脚本时，PyCharm会复制它
改为：
wget.download（url，out=output\u目录）
脚本中断是什么意思？你得到了哪个错误？检查文件更新是什么意思？请尝试使用wget.download（url，out=output\u directory）
首先，文件通常在09:00更新，但有时可能在两三分钟后更新。但是，在html标记中，他们写了“在日期时间更新”，所以我使用BeautifulSoup从标记中提取这个日期和时间字符串表示，将其转换为日期，并与今天的日期进行比较，以确保文件已更新@关于脚本中断的αη。目前，我正在通过PyCharm运行脚本。正如我提到的，当脚本运行时，下载的文件不会出现在文件夹中。因此，我必须单击“停止”并终止脚本。之后，该文件将显示在文件夹中。若要安排再次下载，我需要再次单击“运行”@它起作用了！非常感谢你的帮助@zviThanks很多！切换到“美丽的汤”的请求，确实更快。@AlexRiabukha欢迎您。如果我的回答对你有帮助，请随意接受。