Python 刮取多个页面时经常出现HTTP错误413_Python_Pandas_Web Scraping_Beautifulsoup_Runtime Error

Python 刮取多个页面时经常出现HTTP错误413

python pandas web-scraping

Python 刮取多个页面时经常出现HTTP错误413,python,pandas,web-scraping,beautifulsoup,runtime-error,Python,Pandas,Web Scraping,Beautifulsoup,Runtime Error,我在Wykop.pl（“波兰的Reddit”）上搜索我感兴趣的关键字时返回的多个页面，通过循环浏览这些页面，从中删除帖子。我编写了一个循环来迭代每个页面的目标内容；但是，循环将在某些页面终止（一致），错误为“HTTP错误413:请求实体太大” 我试图逐个刮取有问题的页面，但相同的错误消息不断出现。为了解决这个问题，我不得不手动设置范围来收集数据，但代价是丢失了大量数据，我想知道是否有Pythonic解决方案来处理这个错误。我还尝试了更长的暂停时间，因为可能我有发送太多请求的风险，但事实似乎并非如

我在Wykop.pl（“波兰的Reddit”）上搜索我感兴趣的关键字时返回的多个页面，通过循环浏览这些页面，从中删除帖子。我编写了一个循环来迭代每个页面的目标内容；但是，循环将在某些页面终止（一致），错误为“HTTP错误413:请求实体太大”

我试图逐个刮取有问题的页面，但相同的错误消息不断出现。为了解决这个问题，我不得不手动设置范围来收集数据，但代价是丢失了大量数据，我想知道是否有Pythonic解决方案来处理这个错误。我还尝试了更长的暂停时间，因为可能我有发送太多请求的风险，但事实似乎并非如此

from time import sleep
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
from mtranslate import translate
from IPython.core.display import clear_output


from mtranslate import translate
posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
requests = 0
pages = [str(i) for i in range(1,10)]

for page in pages:
    url = "https://www.wykop.pl/szukaj/wpisy/smog/strona/" + page + "/"
    response = get(url)

    # Pause the loop
    sleep(randint(8,15))

        # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    # Break the loop if the number of requests is greater than expected
    if requests > 10:
        warn('Number of requests was greater than expected.')
        break


    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")


    for result in results:
            # Error handling
            try:

                post = result.find('div', class_="text").text
                post = translate(post,'en','auto')
                posts.append(post)

                date = result.time['title']
                dates.append(date)

                vote = result.p.b.span.text
                vote = int(vote)
                votes.append(vote)

                user = result.div.b.text
                users.append(user)

                image = result.find('img',class_='block lazy')
                images.append(image)

            except AttributeError as e:
                print(e)

如果我可以一次运行所有脚本，我会将范围设置为1到163（因为我已经有163页的文章结果提到了我感兴趣的关键字）。因此，我不得不设置较小的范围以增量方式收集数据，但同样是以丢失数据页为代价的

作为一种应急措施，我还可以从桌面上指定的有问题的页面中删除下载的html文档。

您可能遇到了某种IP地址限制。在运行脚本时，它对我来说很好，没有任何速率限制（目前）。不过，我建议您使用

requests.Session（）

（您需要更改

requests

变量，否则它会覆盖导入）。这有助于减少可能的内存泄漏问题

例如：

from bs4 import BeautifulSoup
from time import sleep
from time import time
from random import randint
import requests

posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
request_count = 0
req_sess = requests.Session()

for page_num in range(1, 100):
    response = req_sess.get(f"https://www.wykop.pl/szukaj/wpisy/smog/strona/{page_num}/")

    # Pause the loop
    #sleep(randint(1,3))

    # Monitor the requests
    request_count += 1
    elapsed_time = time() - start_time
    print('Page {}; Request:{}; Frequency: {} requests/s'.format(page_num, request_count, request_count/elapsed_time))

    #clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        print('Request: {}; Status code: {}'.format(requests, response.status_code))
        print(response.headers)

    # Break the loop if the number of requests is greater than expected
    #if requests > 10:
    #    print('Number of requests was greater than expected.')
    #    break

    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")

    for result in results:
        # Error handling
        try:
            post = result.find('div', class_="text").text
            #post = translate(post,'en','auto')
            posts.append(post)

            date = result.time['title']
            dates.append(date)

            vote = result.p.b.span.text
            vote = int(vote)
            votes.append(vote)

            user = result.div.b.text
            users.append(user)

            image = result.find('img',class_='block lazy')
            images.append(image)

        except AttributeError as e:
            print(e)

给出了以下输出：

第1页；请求：1；频率：1.24613732973911请求/秒
第2页；请求：2；频率：1.3021880233774552请求/秒
第3页；请求：3；频率：1.2663757427416629请求/秒
第4页；请求：4；频率：1.1807827876080845请求/秒
.
.
.
第96页；请求：96；频率：0.888853607003809请求/秒
第97页；请求：97；频率：0.8891876183362001请求/秒
第98页；请求：98；频率：0.8801819672809请求/秒
第99页；请求：99；频率：0.8900784741536467请求/秒

当我开始使用更高的页码时，这也很好。理论上，当您获得413错误状态代码时，它现在应该显示响应头。根据，服务器应该返回一个

Retry After

头字段，您可以使用该字段确定在下一个请求之前要退出多长时间。

您可能遇到了某种IP地址限制。在运行脚本时，它对我来说很好，没有任何速率限制（目前）。不过，我建议您使用

requests.Session（）

（您需要更改

requests

变量，否则它会覆盖导入）。这有助于减少可能的内存泄漏问题

例如：

from bs4 import BeautifulSoup
from time import sleep
from time import time
from random import randint
import requests

posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
request_count = 0
req_sess = requests.Session()

for page_num in range(1, 100):
    response = req_sess.get(f"https://www.wykop.pl/szukaj/wpisy/smog/strona/{page_num}/")

    # Pause the loop
    #sleep(randint(1,3))

    # Monitor the requests
    request_count += 1
    elapsed_time = time() - start_time
    print('Page {}; Request:{}; Frequency: {} requests/s'.format(page_num, request_count, request_count/elapsed_time))

    #clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        print('Request: {}; Status code: {}'.format(requests, response.status_code))
        print(response.headers)

    # Break the loop if the number of requests is greater than expected
    #if requests > 10:
    #    print('Number of requests was greater than expected.')
    #    break

    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")

    for result in results:
        # Error handling
        try:
            post = result.find('div', class_="text").text
            #post = translate(post,'en','auto')
            posts.append(post)

            date = result.time['title']
            dates.append(date)

            vote = result.p.b.span.text
            vote = int(vote)
            votes.append(vote)

            user = result.div.b.text
            users.append(user)

            image = result.find('img',class_='block lazy')
            images.append(image)

        except AttributeError as e:
            print(e)

给出了以下输出：

第1页；请求：1；频率：1.24613732973911请求/秒
第2页；请求：2；频率：1.3021880233774552请求/秒
第3页；请求：3；频率：1.2663757427416629请求/秒
第4页；请求：4；频率：1.1807827876080845请求/秒
.
.
.
第96页；请求：96；频率：0.888853607003809请求/秒
第97页；请求：97；频率：0.8891876183362001请求/秒
第98页；请求：98；频率：0.8801819672809请求/秒
第99页；请求：99；频率：0.8900784741536467请求/秒

当我开始使用更高的页码时，这也很好。理论上，当您获得413错误状态代码时，它现在应该显示响应头。根据，服务器应该返回一个

Retry After

头字段，您可以使用该字段确定在下一个请求之前要退出多长时间。

好，下面是要点：

413错误与Wykop无关，Wykop是一个需要删除的网站，但与mtranslate软件包有关，该软件包依赖于Google Translate的API。在我最初的代码中，当Wykop被删除时，它将文章从波兰语翻译成英语。然而，谷歌翻译API对每个用户的限制是每100秒100000个字符。因此，当代码到达第13页时，mtranslate达到了Google Translate的请求限制。因此，Martin的解决方案可以在禁用translate函数的情况下很好地抓取数据

当我使用模块翻译存储在数据帧中的帖子时，我得出了这个结论，因为我在翻译循环的8%处遇到了相同的错误。

好的，下面是一个问题：

我是在使用模块翻译数据帧中存储的帖子时得出这个结论的，因为我在翻译循环的8%处遇到了相同的错误。

缺少BS4导入，朋友！很可能你只是在粘贴时出错了..还有！内存泄漏是请求库中非常常见的问题。您应该研究异步请求来处理它。更好的是，开始使用刮痧！缺少BS4导入，朋友！很可能你只是在粘贴时出错了..还有！内存泄漏是请求库中非常常见的问题。您应该研究异步请求来处理它。更好的是，开始使用刮痧！嗨，马丁；工作得很有魅力！有趣的是：