Python 从特定网站进行的刮取已停止工作_Python_Web Scraping_Beautifulsoup

Python 从特定网站进行的刮取已停止工作

python web-scraping

Python 从特定网站进行的刮取已停止工作,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,所以几周前我写了一个程序，成功地从网上商店中获取了一些信息，但是现在它已经停止工作了，而我没有修改代码这可能是网站本身发生了变化，还是我的代码有问题 import requests from bs4 import BeautifulSoup url = 'https://www.continente.pt/stores/continente/pt-pt/public/Pages/ProductDetail.aspx?ProductId=7104665(eCsf_RetekProductCat

所以几周前我写了一个程序，成功地从网上商店中获取了一些信息，但是现在它已经停止工作了，而我没有修改代码

这可能是网站本身发生了变化，还是我的代码有问题

import requests
from bs4 import BeautifulSoup

url = 'https://www.continente.pt/stores/continente/pt-pt/public/Pages/ProductDetail.aspx?ProductId=7104665(eCsf_RetekProductCatalog_MegastoreContinenteOnline_Continente)'

res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

priceInfo = soup.find('div', class_='pricePerUnit').text

priceInfo = priceInfo.replace('\n', '').replace('\r', '').replace(' ', '')

productName = soup.find('div', class_='productTitle').text.replace('\n', ' ')

productInfo = (soup.find('div', class_='productSubtitle').text
               + ', ' + soup.find('div', class_='productSubsubtitle').text)

print('Nome do produto: ' + productName)
print('Detalhes: ' + productInfo)
print('Custo: ' + priceInfo)

我知道我搜索的内容确实存在，而且url仍然有效，那么问题出在哪里呢？

我将priceInfo分为两行，因为错误存在于第一个声明中，因为它返回一个没有文本属性的NoneType

解决方案是位多步骤的

试着在Firefox中调用一次你想要抓取的页面

使用浏览器\u cookie3 lib提取cookie

确保它们没有过期

在requests.get中使用cookies（url，cookies=browser\u cookie3.firefox（））

使用如下标题

希望它能起作用！！快乐刮擦

我自己试过了，效果很好

 headers = {
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'en-US,en;q=0.9,de;q=0.8',
}

1）网站可以非常简单地阻止基于用户代理的scraper 2）beautifulsoup在动态呈现网站时不会运行Javascript。每个网站都有一个robots.txt文件，您可以读取该文件以了解可以刮取的内容。这也可能意味着网站改变了布局，而你刮下的元素与你写它时不在同一个位置。谢谢你的回答，但是我仍然有同样的错误，我唯一改变的一行是

res=requests.get（url，cookies=browser\u cookie3.firefox（），headers=headers）

，标题为您提供的标题。关于cookies我还应该做些什么吗？你是否在Firefox中打开了站点并提取了cookies？打印并检查CookieJar我已经打开了网站，但是如何提取Cookie？很抱歉，如果这是显而易见的，你可以简单地按照你的建议提取食谱谢谢你的回复，我会尝试你的建议