Python 正在尝试刮取url_Python_Beautifulsoup_Screen Scraping

Python 正在尝试刮取url

python

Python 正在尝试刮取url,python,beautifulsoup,screen-scraping,Python,Beautifulsoup,Screen Scraping,因此，我试图从steam上的免费游戏网站获取所有url，但它总是返回空的。我不知道我做错了什么，下图显示了路径 result = requests.get("https://steamdb.info/upcoming/free/") src = result.content soup = BeautifulSoup(src, 'lxml') urls = [] for td_tag in soup.find_all('td'): a_tag = td_tag.find('a')

因此，我试图从steam上的免费游戏网站获取所有url，但它总是返回空的。我不知道我做错了什么，下图显示了路径

result = requests.get("https://steamdb.info/upcoming/free/")
src = result.content
soup = BeautifulSoup(src, 'lxml')

urls = []
for td_tag in soup.find_all('td'):
    a_tag = td_tag.find('a')
    urls.append(a_tag.attrs['href'])

print(urls)

您必须使用header User Agent，它不能是Mozilla/5.0的缩写，而是来自真实web浏览器的完整字符串

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
}

result = requests.get("https://steamdb.info/upcoming/free/", headers=headers)
soup = BeautifulSoup(result.content, 'lxml')

#print(result.content)
urls = []
for td_tag in soup.find_all('td'):
    a_tag = td_tag.find('a')
    if a_tag:
        urls.append(a_tag.attrs['href'])

print(urls)

您必须使用header User Agent，它不能是Mozilla/5.0的缩写，而是来自真实web浏览器的完整字符串

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
}

result = requests.get("https://steamdb.info/upcoming/free/", headers=headers)
soup = BeautifulSoup(result.content, 'lxml')

#print(result.content)
urls = []
for td_tag in soup.find_all('td'):
    a_tag = td_tag.find('a')
    if a_tag:
        urls.append(a_tag.attrs['href'])

print(urls)

它可以检查一些标题-通常是用户代理。它可以使用JavaScript添加它-requests/BeautifulSoup不能运行JavaScript。首先显示result.content以查看您得到的信息-可能有关于机器人程序/脚本的消息。您还可以在浏览器中关闭JavaScript并在浏览器中重新加载URL，以查看您可以得到什么。除了Beautiful soup之外，还有其他更好的方法来刮取URL吗？它可以检查一些标题-通常是用户代理。它可以使用JavaScript添加它-requests/BeautifulSoup不能运行JavaScript。首先显示result.content以查看您得到的信息-可能有关于机器人程序/脚本的消息。您还可以在浏览器中关闭JavaScript并在浏览器中重新加载URL以查看您可以获得什么。除了Beautiful soup之外，还有没有更好的方法来刮取URL？是否有任何特定的方法仅从div获取数据？即使用find'div'，{'id'：'live promotions}.find_all'a'？顺便说一句：你也可以使用CSS选项。选择'divlive-a'。如果您使用lxml而不是Beautifulsoup，那么您可以使用类似于soup的xpath。xpath“//div[@id=live promotions]//a”：这只返回其值，因为live promotions与游戏URL是分开的，我必须使用URL=soup.find'table'，{'class'：'table products table responsive flex table hover text left table sortable'}.find_alla。非常感谢。有没有具体的方法只从div获取数据？即使用find'div'，{'id'：'live promotions}。find_all'a'？顺便说一句：你也可以使用CSS selections soup。选择'divlive-promotions a'。如果您使用lxml而不是Beautifulsoup，那么您可以使用类似于soup的xpath。xpath“//div[@id=live promotions]//a”：这只返回其值，因为live promotions与游戏URL是分开的，我必须使用URL=soup.find'table'，{'class'：'table products table responsive flex table hover text left table sortable'}.find_alla。非常感谢