用python抓取需要javascript输入的站点_Python_Web Scraping

用python抓取需要javascript输入的站点

python web-scraping

用python抓取需要javascript输入的站点,python,web-scraping,Python,Web Scraping,我正在尝试使用以下python代码刮取一个网站 import re import requests def get_csrf(page): matchme = r'name="csrfToken" value="(.*)" /' csrf = re.search(matchme, str(page)) csrf = csrf.group(1) return csrf def login(): login_url = 'https://www.edlin

我正在尝试使用以下python代码刮取一个网站

import re
import requests

def get_csrf(page):
    matchme = r'name="csrfToken" value="(.*)" /'
    csrf = re.search(matchme, str(page))
    csrf = csrf.group(1)
    return csrf

def login():
    login_url = 'https://www.edline.net/InterstitialLogin.page'

    with requests.Session() as s:
        login_page = s.get(login_url)
        csrf = get_csrf(login_page.text)

        username = 'USER'
        password = 'PASS'

        login = {'screenName': username,
                 'kclq': password,
                 'csrfToken': csrf,
                 'TCNK':'authenticationEntryComponent',
                 'submitEvent':'1',
                 'enterClicked':'true',
                 'ajaxSupported':'yes'}
        page = s.post(login_url, data=login)
        r = s.get("https://www.edline.net/UserDocList.page?")
        print(r.text)

login()

此代码成功登录，但尝试登录时失败

r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)

它不会打印预期的页面，而是抛出一个错误。经过进一步测试，我发现即使您试图从浏览器直接转到页面，它也会抛出此错误。这意味着访问页面的唯一方法是运行在单击按钮进入页面时执行的代码。因此，当我调查页面源代码时，我发现用于链接到我试图抓取的页面的按钮使用了以下代码

<a href="javascript:submitEvent('viewUserDocList', 'TCNK=headerComponent')" tabindex="-1">Private Reports</a>

因此，本质上，我正在寻找一种方法来触发python中的上述javascript代码，以便刮取生成的页面。

由于网站使用javascript，因此您需要类似selenium的东西来使用浏览器访问页面。以下代码将像其他代码一样登录到edline：

from selenium import webdriver
import time
driver = webdriver.Firefox() #any browser really
url = 'https://www.edline.net/InterstitialLogin.page'
driver.get(url)
username_text = driver.find_element_by_xpath('//*[@id="screenName"]') #finds the username text box
username_text.send_keys('username') #sends 'username' to the username text box
password_text = driver.find_element_by_xpath('//*[@id="kclq"]') #finds the password text box
password_text.send_keys('password') # sends 'password' to the password text box
click_button = 
driver.find_element_by_xpath('/html/body/form[3]/div/div[2]/div/div[1]/div[3]/button').click() #finds the submit button and clicks on it

一旦您登录，就可以获得完整的预期页面。使用Selenium文档很容易找到如何使用它的方法！如果您还有其他问题，请告诉我。

由于该网站使用javascript，您需要类似selenium的东西，以便使用浏览器访问该页面。以下代码将像其他代码一样登录到edline：

from selenium import webdriver
import time
driver = webdriver.Firefox() #any browser really
url = 'https://www.edline.net/InterstitialLogin.page'
driver.get(url)
username_text = driver.find_element_by_xpath('//*[@id="screenName"]') #finds the username text box
username_text.send_keys('username') #sends 'username' to the username text box
password_text = driver.find_element_by_xpath('//*[@id="kclq"]') #finds the password text box
password_text.send_keys('password') # sends 'password' to the password text box
click_button = 
driver.find_element_by_xpath('/html/body/form[3]/div/div[2]/div/div[1]/div[3]/button').click() #finds the submit button and clicks on it

一旦您登录，就可以获得完整的预期页面。使用Selenium文档很容易找到如何使用它的方法！如果您还有其他问题，请告诉我。

使用，因为它可以让您以与浏览器上的用户相同的方式使用python与页面交互。在Chrome/Firefox中使用

DevTools

，查看当您单击此按钮时，浏览器使用了哪些值和url。@furas在DevTools中我应该查看的内容是选项卡“网络”您可以看到从浏览器发送到服务器的所有请求。您可以使用“清除”按钮在您单击页面上的链接之前删除所有请求-然后您应该会看到在单击链接之后发送的所有请求。使用它可以让您使用python与页面进行交互，就像浏览器上的用户一样。使用Chrome/Firefox中的

DevTools

，查看当您单击此按钮时浏览器使用了哪些值和url。@furas what我应该在DevTools中查看/fors选项卡“Network”，您可以看到从浏览器发送到服务器的所有请求。在单击页面上的链接之前，您可以使用“清除”按钮删除所有请求，然后在单击链接之后，您应该会看到所有请求都已发送。是否只执行相同的操作而不使其打开浏览器？我可以让它在后台做吗？你不需要用其他方式。如果需要，可以隐藏浏览器。有没有办法只做同样的事情而不让它打开浏览器？我可以让它在后台做吗？你不需要用其他方式。如果需要，可以隐藏浏览器。