Python 404 HTTP错误，尽管可以在浏览器中查看页面_Python_Beautifulsoup_Web Crawler_Screen Scraping_Urllib

Python 404 HTTP错误，尽管可以在浏览器中查看页面

python web-crawler

Python 404 HTTP错误，尽管可以在浏览器中查看页面,python,beautifulsoup,web-crawler,screen-scraping,urllib,Python,Beautifulsoup,Web Crawler,Screen Scraping,Urllib,我正在尝试映射此网站，但在尝试完全爬网时遇到问题。我得到一个错误404，即使URL存在这是我的密码： import csv from urllib.request import urlopen from bs4 import BeautifulSoup import re csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt') pagina

我正在尝试映射此网站，但在尝试完全爬网时遇到问题。我得到一个错误404，即使URL存在

这是我的密码：

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
    global paginas
    html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    writer = csv.writer(csvFile)
    for link in bsObj.findAll("a"):
      if 'href' in link.attrs:
       if link.attrs['href'] not in paginas:
             #nova página encontrada
                newPage = link.attrs['href']
                print(newPage)
                paginas.add(newPage)
                getLinks(newPage)
                csvRow = []
                csvRow.append(newPage)
                writer.writerow(csvRow)

   
getLinks("")
csvFile.close()

这是我尝试运行代码后收到的错误消息：

#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
    getLinks("")
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  [Previous line repeated 4 more times]
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
    html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
    response = meth(req, response)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
    response = self.parent.error(
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
    return self._call_chain(*args)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>

#包装器
/
#主导航
#nolivesearchGadget
#平铺总线输入
#门户页脚
http://brasil.gov.br
回溯（最近一次呼叫最后一次）：
文件“c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py”，第26行，在
获取链接（“”）
文件“c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py”，第20行，在getLinks中
获取链接（新页面）
文件“c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py”，第20行，在getLinks中
获取链接（新页面）
文件“c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py”，第20行，在getLinks中
获取链接（新页面）
[上一行重复了4次]
文件“c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py”，第10行，在getLinks中
html=urlopen（“https://www.gov.br/pt-br/“+pageUrl）
文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”，urlopen中的第214行
返回opener.open（url、数据、超时）
文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”，第523行，处于打开状态
响应=方法（请求，响应）
http\U响应中的文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”，第632行
响应=self.parent.error(
文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”第561行出错
返回自我。调用链（*args）
文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”，第494行，在调用链中
结果=func（*args）
文件“C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py”，第641行，默认为http\u error\u
raise HTTPError（请求完整的url、代码、消息、hdrs、fp）
urllib.error.HTTPError:HTTP错误404:未找到
PS C:\Users\Pichau\codigo\govbr>

我试着只使用主链接，它工作得很好，但只要我将

pageurl

变量添加到url，它就会给我这个错误。我该如何修复这个错误呢？

从我所看到的，你是对的-页面就在那里了…对于我们这些浏览器用户来说。我假设正在发生的是一些基本的反僵尸机制，它禁止你使用ncommon UserAgents，或者换句话说，只允许浏览器查看页面。但是，由于用户代理是我们可以控制的标题，我们可以对其进行操作，这样它就不会抛出404错误

目前我无法为它键入代码，但您需要配对，您必须编写一些代码，以获取答案，并将“UserAgent”头更改为类似于

Mozilla/5.0（Windows NT 10.0；Win64；x64）AppleWebKit/537.36（KHTML，如Gecko）Chrome/90.0.4430.93 Safari/537.36的值，这是我从中获取的
更改UserAgent标题后，您应该能够成功下载该页面。
如果不知道pageUrl
包含的内容，我们将无法帮助您。请花一些时间阅读和修改。按照这些文章中的提示，您将获得更好的结果。因此，我现在确实更改了标题，感谢您对t的澄清帽子，但现在我得到了一个不同的错误：urllib.error.urleror：您键入的URL错误或服务器已关闭。如果我的答案对您有帮助，请确保向上投票并单击它左侧的复选标记按钮！