Python 漂亮的汤刮-登录凭据不起作用_Python_Beautifulsoup

Python 漂亮的汤刮-登录凭据不起作用

python

Python 漂亮的汤刮-登录凭据不起作用,python,beautifulsoup,Python,Beautifulsoup,正在尝试使用登录凭据刮取页面 payload = { 'email': '*******@gmail.com', 'password': '***' } urls = [] login_url = 'https://www.spotrac.com/signin/' url = 'https://www.spotrac.com/nba/contracts/breakdown/2010/' webpage = requests.get(login_url, payload) co

正在尝试使用登录凭据刮取页面

payload = {
    'email': '*******@gmail.com',
    'password': '***'
}

urls = []

login_url = 'https://www.spotrac.com/signin/'
url = 'https://www.spotrac.com/nba/contracts/breakdown/2010/'
webpage = requests.get(login_url, payload)
content = webpage.content
soup = BeautifulSoup(content)
a = soup.find('table',{'class':'datatable'})
urls.append(a)

这是我第一次抓取带有凭证的页面，但似乎不知道如何正确输入凭证

查看：查看：我们也看了几个答案

我在源页面上搜索了一个csrf令牌，但什么也没有找到。我知道每个网站都有一个特定的带有登录名的抓取页面；有人可以检查这个特定的登录站点，看看我可以在哪里改进这个代码吗

棘手的是，它将cookie与谷歌分析结合使用，而请求不需要；我不会收到要在标题中使用的内容。但是，您可以通过使用Selenium登录来获取这些cookie。一旦通过这种方式获得cookie，您就可以将其与

请求

模块一起使用来浏览页面

我还没有完全弄清楚如何处理弹出式广告（所以有时这会起作用，有时你需要再次尝试运行），但似乎一旦你通过初始登录，它就会起作用。由于它必须进入每个玩家链接，因此从2010年到2020年，浏览375名玩家的总名单需要2-3分钟或一些时间：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")

# Use Selenium to login and get all cookies 
loginURL = 'https://www.spotrac.com/signin/'
username = 'xxxxxx'
password = 'xxxxxx'


driver.get(loginURL)

try:
    # Wait for cookie message
    accept_cookie = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.CSS_SELECTOR, '.cookie-alert-accept']))
    accept_cookie.click()
    print("Cookies accepted")
except TimeoutException:
    print("no alert")

try:
    # Wait for cookie message

    popup = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.CSS_SELECTOR, '.cls-btn']))
    popup.click()

except TimeoutException:
    print("timed out")


time.sleep(5)
driver.find_element_by_name("email").send_keys(username)
driver.find_element_by_name("password").send_keys(password)

submit = WebDriverWait(driver, 100).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="contactForm"]/div[2]/input')))
submit.click()
print ('Logged in!')



# Now that the cookies are there, can use requests to iterate through the links
for seas in range(2020, 2009, -1):
    print(seas)
    url = 'https://www.spotrac.com/nba/contracts/breakdown/%s/' %seas

    driver.get(url)

    playerDict = {}
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    players = soup.find_all('td',{'class':'player'})
    for player in players:
        name = player.find('a').text
        link = player.find('a')['href']

        if name not in playerDict.keys():
            playerDict[name] = link

results = pd.DataFrame()
count = 1    
for player, link in playerDict.items():

    driver.get(link)

    dfs = pd.read_html(driver.page_source)

    df = pd.DataFrame()
    for i, table in enumerate(dfs):
        if len(table.columns) == 2 and len(table) == 5:
            idx = i
            temp_df = table.T
            temp_df.columns = temp_df.iloc[0]
            temp_df = temp_df.rename(columns={'Average Salary:':'Avg. Salary:','Avg Salary:':'Avg. Salary:'})



            try:
                seasonContract = dfs[idx-2].iloc[0,0]
                year = re.findall(r"\d\d\d\d-\d\d\d\d",seasonContract)[0]
                seasonContract = year + ' ' + re.split(year, seasonContract)[-1]
            except:
                seasonContract = 'Current Contract'

            temp_df['Player'] = player
            temp_df['Contract Years'] = seasonContract
            df = df.append(temp_df.iloc[1:], sort=False).reset_index(drop=True)

    results = results.append(df,sort=False).reset_index(drop=True)
    print ('%03d of %d - %s data aquired...' %(count, len(playerDict), player))
    count += 1



driver.close()

输出：

print (results.head(25).to_string())
0                Contract: Signing Bonus: Avg. Salary: Signed Using: Free Agent:         Player         Contract Years
0    2 yr(s) / $48,500,000              -  $24,250,000          Bird  2016 / UFA    Kobe Bryant             2014-2015 
1    3 yr(s) / $83,547,447              -  $27,849,149          Bird         0 /    Kobe Bryant             2011-2013 
2   7 yr(s) / $136,434,375              -  $19,490,625           NaN  2011 / UFA    Kobe Bryant             2004-2010 
3    5 yr(s) / $56,255,000              -  $11,251,000           NaN  2004 / UFA    Kobe Bryant             1999-2003 
4     3 yr(s) / $3,501,240              -   $1,167,080           NaN         0 /    Kobe Bryant  1996-1998 Entry Level
5     2 yr(s) / $2,751,688              -   $1,375,844       Minimum  2014 / UFA  Rashard Lewis             2012-2013 
6    1 yr(s) / $13,765,000              -  $13,765,000           NaN         0 /  Rashard Lewis             2012-2012 
7   6 yr(s) / $118,200,000              -  $19,700,000           NaN  2013 / UFA  Rashard Lewis             2007-2012 
8    4 yr(s) / $32,727,273              -   $8,181,818           NaN  2007 / UFA  Rashard Lewis             2003-2006 
9    3 yr(s) / $14,567,141              -   $4,855,714           NaN  2003 / UFA  Rashard Lewis             2000-2002 
10      2 yr(s) / $672,500              -     $336,250           NaN  2000 / RFA  Rashard Lewis  1998-1999 Entry Level
11   2 yr(s) / $10,850,000              -   $5,425,000          Bird  2017 / UFA     Tim Duncan             2015-2016 
12   3 yr(s) / $30,361,446              -  $10,120,482          Bird  2015 / UFA     Tim Duncan             2012-2014 
13   4 yr(s) / $40,000,000              -  $10,000,000           NaN  2012 / UFA     Tim Duncan             2010-2011 
14  7 yr(s) / $122,007,706              -  $17,429,672           NaN  2010 / UFA     Tim Duncan             2003-2009 
15   3 yr(s) / $31,902,500              -  $10,634,167           NaN  2003 / UFA     Tim Duncan             2000-2002 
16   3 yr(s) / $10,239,080              -   $3,413,027           NaN  2000 / UFA     Tim Duncan  1997-1999 Entry Level
17   2 yr(s) / $16,500,000              -   $8,250,000          Bird  2017 / UFA  Kevin Garnett             2015-2016 
18   3 yr(s) / $36,000,000              -  $12,000,000          Bird  2015 / UFA  Kevin Garnett             2012-2014 
19   3 yr(s) / $51,300,000              -  $17,100,000           NaN  2012 / UFA  Kevin Garnett             2009-2011 
20  5 yr(s) / $100,000,000              -  $20,000,000           NaN  2009 / UFA  Kevin Garnett             2004-2008 
21  6 yr(s) / $126,016,300              -  $21,002,717           NaN         0 /  Kevin Garnett             1998-2003 
22    3 yr(s) / $5,397,120              -   $1,799,040        Rookie         0 /  Kevin Garnett  1995-1997 Entry Level
23    1 yr(s) / $1,308,506              -   $1,308,506           NaN  2012 / UFA   Michael Redd       Current Contract
24   6 yr(s) / $90,100,000              -  $15,016,667           NaN  2011 / UFA   Michael Redd             2005-2010 
....

您好，请登录我做：

headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
     try:

        url = 'https://empresa.computrabajo.com.mx/login'
        r = s.get(url, headers=headers)
        soup = BeautifulSoup(r.content, 'html.parser')
#the page in the login have a hiden attribute call token for do the login
        token = soup.find("input", attrs={'name': '__RequestVerificationToken'}).get('value')
        login_data = {
            '__RequestVerificationToken': token,
            'UserName': 'YOURUSERNAME',
            'Password': 'YOURPASS',
            'KeepMeLoggedIn': 'false'
        }
        r = s.post(url, data=login_data, headers=headers)
    except Exception as getError:
        print('Error Login')

你可以在youtube上找到很多信息，查看如何在google chrome中使用网络工具捕获需要在您感兴趣的登录页面中使用的标题和url。

尝试

请求。post

而不是getI可以从尝试登录站点中看到登录url应该是

https://www.spotrac.com/signin/submit/

并且它应该是一个post请求。您需要从响应中提取某种令牌或cookie来执行任何后续请求，尽管没有有效的登录名，我看不到更多内容。@RobP谢谢您的响应。如果我向您提供登录凭据，您认为您可以进一步探索吗？我认为，对于您来说，更简单的方法可能是使用selenium之类的工具，它可以为您执行任何JavaScript并跟踪Cookie。基本上，这就是这里所需要的，你现有的逻辑将很快与这样的工具一起工作：）@AdamA，给我发封电子邮件。我可以帮你。杰森。schvach@gmail.com