Python 如何编写代码来读取输出文件，从而计算出它在抓取网站方面走了多远，然后从它离开的地方开始_Python_For Loop_Web Scraping_Http Error

Python 如何编写代码来读取输出文件，从而计算出它在抓取网站方面走了多远，然后从它离开的地方开始

python for-loop web-scraping

Python 如何编写代码来读取输出文件，从而计算出它在抓取网站方面走了多远，然后从它离开的地方开始,python,for-loop,web-scraping,http-error,Python,For Loop,Web Scraping,Http Error,我正在编写一个程序，从本网站存档的每篇文章中提取文章标题、日期和正文文本，并导出为csv文件。该网站似乎在某个时候阻止了我，我得到了这个错误：HTTPError:Service Unavailable 我相信这是因为我试图在短时间内访问他们的网站太多次了。我希望我的代码能够读取错误发生的位置，并从错误停止的位置开始在阅读了10篇文章之后，我尝试将延迟添加到延迟2秒。我还尝试过每十篇文章后的随机延迟。我可以增加更长的延迟，但我希望代码能够在感觉到万无一失的地方继续运行 from bs4 impo

我正在编写一个程序，从本网站存档的每篇文章中提取文章标题、日期和正文文本，并导出为csv文件。该网站似乎在某个时候阻止了我，我得到了这个错误：HTTPError:Service Unavailable

我相信这是因为我试图在短时间内访问他们的网站太多次了。我希望我的代码能够读取错误发生的位置，并从错误停止的位置开始

在阅读了10篇文章之后，我尝试将延迟添加到延迟2秒。我还尝试过每十篇文章后的随机延迟。我可以增加更长的延迟，但我希望代码能够在感觉到万无一失的地方继续运行

from bs4 import BeautifulSoup
from urllib.request import urlopen
import csv
from time import sleep
from random import randint

csvfile = "C:/Users/k/Dropbox/granularitygrowth/Politico/pol.csv"
with open(csvfile, mode='w', newline='', encoding='utf-8') as pol:
    csvwriter = csv.writer(pol, delimiter='~', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(["Date", "Title", "Article"])

    #for each page on Politico archive
    for p in range(0,412):
        url = urlopen("https://www.politico.com/newsletters/playbook/archive/%d" % p)
        content = url.read()

        #Parse article links from page
        soup = BeautifulSoup(content,"lxml")
        articleLinks = soup.findAll('article', attrs={'class':'story-frag format-l'})

        #Each article link on page
        for article in articleLinks:
            link = article.find('a', attrs={'target':'_top'}).get('href')

            #Open and read each article link
            articleURL = urlopen(link)
            articleContent = articleURL.read()

            #Parse body text from article page
            soupArticle = BeautifulSoup(articleContent, "lxml")

            #Limits to div class = story-text tag (where article text is)
            articleText = soupArticle.findAll('div', attrs={'class':'story-text'})
            for div in articleText:

                #Find date
                footer = div.find('footer', attrs={'class':'meta'})
                date = footer.find('time').get('datetime')
                print(date)

                #Find title
                headerSection = div.find('header')
                title = headerSection.find('h1').text
                print(title)

                #Find body text
                textContent = ""
                bodyText = div.findAll('p')
                for p in bodyText:
                    p_string = str(p.text)
                    textContent += p_string + ' '
                print(textContent)

                #Adds data to csv file
                csvwriter.writerow([date, title, textContent])

        time.sleep(randint(3,8))

我希望我的代码仍然存在此错误，但从它停止的地方继续打印并将数据导出到csv文件。

您可以计算您在csv中保存的文章数，int除以10 page=1+记录//10+1是第一页，以获得您所在的最后一页

我对您的代码进行了如下重构：

Finished page 48
{'Title': 'Playbook: Scalise takes several Republicans to ...
{'Title': 'Playbook: Four unfolding events that show the  ...
{'Title': 'Playbook: Texas kicks off primary season, as D ...
{'Title': 'Playbook: The next gen: McCarthy and Crowley’s ...
{'Title': 'INSIDE THE GRIDIRON DINNER: What Trump said an ...
{'Title': 'DEMS spending millions already to boost vulner ...
{'Title': 'Playbook: Inside the Republican super PAC mone ...
{'Title': 'Playbook: Who would want to be White House com ...
{'Title': "Playbook: Jared Kushner's bad day", 'Date': '2 ...
{'Title': 'Playbook: Gun control quickly stalls in the Se ...
Finished page 49

导入csv 导入时间从随机导入randint 从urllib.request导入urlopen 从bs4导入BeautifulSoup 标题=[日期、标题、文章] def count_rowscsv_路径：str->int: 使用opencsv_路径作为f： reader=csv.DictReaderf 返回透镜列表读取器 def write_articlescsv_路径：str，articles:list：注意追加模式，写入模式将删除所有内容并重新开始使用opencsv_路径“a”，编码为“utf-8”，换行符为f: writer=csv.dictwriter， quoting=csv.QUOTE_最小值，字段名=标题 writer.writerowsarticles def init_csvcsv_路径：str: 使用opencsv_路径“w”，编码为“utf-8”，换行符为f: writer=csv.dictwriter，fieldnames=HEADERS，quoting=csv.QUOTE\u最小值编剧 def get_page_soupurl:str->美化组： response=urlopenurl html=response.read soup=BeautifulSouphtml，lxml 返汤 def scrape_articleurl:str->dict: article\u soup=get\u page\u soupurl 限制为div class=文章文本所在的故事文本标记故事=文章汤。选择一个。故事文本查找日期日期=故事。选择一个。时间戳时间“['datetime'] 查找标题 title=故事\查找'h1'。文本查找正文文本第条文本= 对于故事中的p，找到所有的“p”：文章_text+=p.text+“” 返回{ “标题”：标题， “日期”：日期， “文章”：文章文本 } def主： csvfile=test.csv 尝试：记录\计数=计数\行SCSVFILE 除FileNotFoundError外： init_csvcsv文件打印“初始化的CSV文件” 记录计数=0 每页文章=10 页数=1+记录页数//每页文章打印“从第页继续”，第页条款=[] 对于第413页中的p： url=https://www.politico.com/newsletters/playbook/archive/%d %p 汤=获取页面文章链接=汤。选择'article.story frag.format-l' 页面上的每个文章链接关于文章中的文章链接： link=article。选择[target=\u top]'['href'] scraped\u article=scrape\u article链接印刷品附属品写文章打印“完成页”，第页时间。睡眠时间3，8 如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：主要的这将为您提供如下输出：

Finished page 48
{'Title': 'Playbook: Scalise takes several Republicans to ...
{'Title': 'Playbook: Four unfolding events that show the  ...
{'Title': 'Playbook: Texas kicks off primary season, as D ...
{'Title': 'Playbook: The next gen: McCarthy and Crowley’s ...
{'Title': 'INSIDE THE GRIDIRON DINNER: What Trump said an ...
{'Title': 'DEMS spending millions already to boost vulner ...
{'Title': 'Playbook: Inside the Republican super PAC mone ...
{'Title': 'Playbook: Who would want to be White House com ...
{'Title': "Playbook: Jared Kushner's bad day", 'Date': '2 ...
{'Title': 'Playbook: Gun control quickly stalls in the Se ...
Finished page 49

通过在except块中捕获HTTPError，您可以使用模块保存代码停止的位置