Python 漂亮的汤刮-登录凭据不起作用
正在尝试使用登录凭据刮取页面Python 漂亮的汤刮-登录凭据不起作用,python,beautifulsoup,Python,Beautifulsoup,正在尝试使用登录凭据刮取页面 payload = { 'email': '*******@gmail.com', 'password': '***' } urls = [] login_url = 'https://www.spotrac.com/signin/' url = 'https://www.spotrac.com/nba/contracts/breakdown/2010/' webpage = requests.get(login_url, payload) co
payload = {
'email': '*******@gmail.com',
'password': '***'
}
urls = []
login_url = 'https://www.spotrac.com/signin/'
url = 'https://www.spotrac.com/nba/contracts/breakdown/2010/'
webpage = requests.get(login_url, payload)
content = webpage.content
soup = BeautifulSoup(content)
a = soup.find('table',{'class':'datatable'})
urls.append(a)
这是我第一次抓取带有凭证的页面,但似乎不知道如何正确输入凭证
查看:
查看:
我们也看了几个答案
我在源页面上搜索了一个csrf令牌,但什么也没有找到。我知道每个网站都有一个特定的带有登录名的抓取页面;有人可以检查这个特定的登录站点,看看我可以在哪里改进这个代码吗 棘手的是,它将cookie与谷歌分析结合使用,而请求不需要;我不会收到要在标题中使用的内容。但是,您可以通过使用Selenium登录来获取这些cookie。一旦通过这种方式获得cookie,您就可以将其与
请求
模块一起使用来浏览页面
我还没有完全弄清楚如何处理弹出式广告(所以有时这会起作用,有时你需要再次尝试运行),但似乎一旦你通过初始登录,它就会起作用。由于它必须进入每个玩家链接,因此从2010年到2020年,浏览375名玩家的总名单需要2-3分钟或一些时间:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
# Use Selenium to login and get all cookies
loginURL = 'https://www.spotrac.com/signin/'
username = 'xxxxxx'
password = 'xxxxxx'
driver.get(loginURL)
try:
# Wait for cookie message
accept_cookie = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.CSS_SELECTOR, '.cookie-alert-accept']))
accept_cookie.click()
print("Cookies accepted")
except TimeoutException:
print("no alert")
try:
# Wait for cookie message
popup = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.CSS_SELECTOR, '.cls-btn']))
popup.click()
except TimeoutException:
print("timed out")
time.sleep(5)
driver.find_element_by_name("email").send_keys(username)
driver.find_element_by_name("password").send_keys(password)
submit = WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="contactForm"]/div[2]/input')))
submit.click()
print ('Logged in!')
# Now that the cookies are there, can use requests to iterate through the links
for seas in range(2020, 2009, -1):
print(seas)
url = 'https://www.spotrac.com/nba/contracts/breakdown/%s/' %seas
driver.get(url)
playerDict = {}
soup = BeautifulSoup(driver.page_source, 'html.parser')
players = soup.find_all('td',{'class':'player'})
for player in players:
name = player.find('a').text
link = player.find('a')['href']
if name not in playerDict.keys():
playerDict[name] = link
results = pd.DataFrame()
count = 1
for player, link in playerDict.items():
driver.get(link)
dfs = pd.read_html(driver.page_source)
df = pd.DataFrame()
for i, table in enumerate(dfs):
if len(table.columns) == 2 and len(table) == 5:
idx = i
temp_df = table.T
temp_df.columns = temp_df.iloc[0]
temp_df = temp_df.rename(columns={'Average Salary:':'Avg. Salary:','Avg Salary:':'Avg. Salary:'})
try:
seasonContract = dfs[idx-2].iloc[0,0]
year = re.findall(r"\d\d\d\d-\d\d\d\d",seasonContract)[0]
seasonContract = year + ' ' + re.split(year, seasonContract)[-1]
except:
seasonContract = 'Current Contract'
temp_df['Player'] = player
temp_df['Contract Years'] = seasonContract
df = df.append(temp_df.iloc[1:], sort=False).reset_index(drop=True)
results = results.append(df,sort=False).reset_index(drop=True)
print ('%03d of %d - %s data aquired...' %(count, len(playerDict), player))
count += 1
driver.close()
输出:
print (results.head(25).to_string())
0 Contract: Signing Bonus: Avg. Salary: Signed Using: Free Agent: Player Contract Years
0 2 yr(s) / $48,500,000 - $24,250,000 Bird 2016 / UFA Kobe Bryant 2014-2015
1 3 yr(s) / $83,547,447 - $27,849,149 Bird 0 / Kobe Bryant 2011-2013
2 7 yr(s) / $136,434,375 - $19,490,625 NaN 2011 / UFA Kobe Bryant 2004-2010
3 5 yr(s) / $56,255,000 - $11,251,000 NaN 2004 / UFA Kobe Bryant 1999-2003
4 3 yr(s) / $3,501,240 - $1,167,080 NaN 0 / Kobe Bryant 1996-1998 Entry Level
5 2 yr(s) / $2,751,688 - $1,375,844 Minimum 2014 / UFA Rashard Lewis 2012-2013
6 1 yr(s) / $13,765,000 - $13,765,000 NaN 0 / Rashard Lewis 2012-2012
7 6 yr(s) / $118,200,000 - $19,700,000 NaN 2013 / UFA Rashard Lewis 2007-2012
8 4 yr(s) / $32,727,273 - $8,181,818 NaN 2007 / UFA Rashard Lewis 2003-2006
9 3 yr(s) / $14,567,141 - $4,855,714 NaN 2003 / UFA Rashard Lewis 2000-2002
10 2 yr(s) / $672,500 - $336,250 NaN 2000 / RFA Rashard Lewis 1998-1999 Entry Level
11 2 yr(s) / $10,850,000 - $5,425,000 Bird 2017 / UFA Tim Duncan 2015-2016
12 3 yr(s) / $30,361,446 - $10,120,482 Bird 2015 / UFA Tim Duncan 2012-2014
13 4 yr(s) / $40,000,000 - $10,000,000 NaN 2012 / UFA Tim Duncan 2010-2011
14 7 yr(s) / $122,007,706 - $17,429,672 NaN 2010 / UFA Tim Duncan 2003-2009
15 3 yr(s) / $31,902,500 - $10,634,167 NaN 2003 / UFA Tim Duncan 2000-2002
16 3 yr(s) / $10,239,080 - $3,413,027 NaN 2000 / UFA Tim Duncan 1997-1999 Entry Level
17 2 yr(s) / $16,500,000 - $8,250,000 Bird 2017 / UFA Kevin Garnett 2015-2016
18 3 yr(s) / $36,000,000 - $12,000,000 Bird 2015 / UFA Kevin Garnett 2012-2014
19 3 yr(s) / $51,300,000 - $17,100,000 NaN 2012 / UFA Kevin Garnett 2009-2011
20 5 yr(s) / $100,000,000 - $20,000,000 NaN 2009 / UFA Kevin Garnett 2004-2008
21 6 yr(s) / $126,016,300 - $21,002,717 NaN 0 / Kevin Garnett 1998-2003
22 3 yr(s) / $5,397,120 - $1,799,040 Rookie 0 / Kevin Garnett 1995-1997 Entry Level
23 1 yr(s) / $1,308,506 - $1,308,506 NaN 2012 / UFA Michael Redd Current Contract
24 6 yr(s) / $90,100,000 - $15,016,667 NaN 2011 / UFA Michael Redd 2005-2010
....
您好,请登录我做:
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
try:
url = 'https://empresa.computrabajo.com.mx/login'
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
#the page in the login have a hiden attribute call token for do the login
token = soup.find("input", attrs={'name': '__RequestVerificationToken'}).get('value')
login_data = {
'__RequestVerificationToken': token,
'UserName': 'YOURUSERNAME',
'Password': 'YOURPASS',
'KeepMeLoggedIn': 'false'
}
r = s.post(url, data=login_data, headers=headers)
except Exception as getError:
print('Error Login')
你可以在youtube上找到很多信息,查看如何在google chrome中使用网络工具捕获需要在您感兴趣的登录页面中使用的标题和url。尝试
请求。post
而不是getI可以从尝试登录站点中看到登录url应该是https://www.spotrac.com/signin/submit/
并且它应该是一个post请求。您需要从响应中提取某种令牌或cookie来执行任何后续请求,尽管没有有效的登录名,我看不到更多内容。@RobP谢谢您的响应。如果我向您提供登录凭据,您认为您可以进一步探索吗?我认为,对于您来说,更简单的方法可能是使用selenium之类的工具,它可以为您执行任何JavaScript并跟踪Cookie。基本上,这就是这里所需要的,你现有的逻辑将很快与这样的工具一起工作:)@AdamA,给我发封电子邮件。我可以帮你。杰森。schvach@gmail.com