使用python和selenium进行Web抓取_Python_Selenium

使用python和selenium进行Web抓取

python selenium

使用python和selenium进行Web抓取,python,selenium,Python,Selenium,刚接触堆栈，现在已经学习Python几个月了。我在写一个脚本的过程中登录到一个网站（我是订户）和刮文章标题和文本到目前为止，我已经能够登录到该网站，并获得与文章标题页面，并拉标题的第一页。然而，我在浏览网页时遇到了麻烦 from selenium import webdriver chrome_path = r"C:\Users\user.name\Desktop\chromedriver.exe" driver = webdriver.Chrome(chrome_path) driver

刚接触堆栈，现在已经学习Python几个月了。我在写一个脚本的过程中登录到一个网站（我是订户）和刮文章标题和文本

到目前为止，我已经能够登录到该网站，并获得与文章标题页面，并拉标题的第一页。然而，我在浏览网页时遇到了麻烦

from selenium import webdriver

chrome_path = r"C:\Users\user.name\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)

driver.get("http://www.WEBSITE.co.uk/")
driver.find_element_by_name("ctl00$LoginView1$Login1$UserName").send_keys('USERNAME')  # Enters username
driver.find_element_by_name("ctl00$LoginView1$Login1$Password").send_keys('PASSWORD')  # Enters password
driver.find_element_by_name("ctl00$LoginView1$Login1$Submit").click()  # Submits username/password
driver.find_element_by_xpath('//*[@id="middle_col"]/div[2]/div[1]/a[1]').click()  # Clicks on more articles


def title_scraper(max_pages):  # A loop to cycle through xpaths of various pages (?)
    page = 2  # Set at 2 for test circa 40 in total
    while page < max_pages:
        newPage = '//*[@id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(page) + ']/a'  # xpath = //*[@id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[1]/a - it is td[1] which increases depending on page number

driver.find_element_by_xpath(newPage).click()  # Scrapes article titles, currently only does the first page

titles = driver.find_elements_by_class_name("articletitle")
for title in titles:
    print(title.text)

从selenium导入webdriver
chrome\u path=r“C:\Users\user.name\Desktop\chromedriver.exe”
driver=webdriver.Chrome（Chrome\u路径）
驱动程序。获取（“http://www.WEBSITE.co.uk/")
驱动程序。通过名称（“ctl00$LoginView1$Login1$UserName”）查找元素。发送密钥（“UserName”）#输入用户名
驱动程序。通过名称（“ctl00$LoginView1$Login1$Password”）查找元素。发送密钥（“密码”）#输入密码
驱动程序。按名称（“ctl00$LoginView1$Login1$Submit”）查找元素。单击（）提交用户名/密码
驱动程序。通过xpath（'/*[@id=“middle\u col”]/div[2]/div[1]/a[1]）查找元素。单击（）#单击更多文章
def title_scraper（最大页面数）：#一个循环，用于循环各种页面的XPath（？）
page=2#设置为2，用于总共约40次的测试
当页面<最大页面数时：
newPage='/*[@id=“ctl00\u main contentarea\u articleslisting1\u gvwArticles”]/tbody/tr[11]/td/table/tbody/tr/td['+str（page）+']/a'.\xpath=/*[@id=“ctl00\u main contentarea\u articleslisting1\u gvwArticles”]/tbody/tr[11]/td/table/tbody/tr/td[1]/a-td[1]随页码的增加而增加
driver.find_element_by_xpath（newPage）。单击（）#刮取文章标题，目前只刮取第一页
titles=驱动程序。通过类名称（“articletitle”）查找元素
标题中的标题：
打印（标题.文本）

很抱歉，如果这个问题已经得到了回答，我还没有在网上资源运气到目前为止

更新：

def title_scraper(max_pages):
    page = 2
    while page < max_pages:
        path = '//*[@id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(
            max_pages) + ']/a'
        driver.find_element_by_xpath(path)

    titles = driver.find_elements_by_class_name("articletitle")
    for title in titles:
        print(title.text)

def title_scraper（最大页数）：
页码=2
当页面<最大页面数时：
path='/*[@id=“ctl00\u mainContentArea\u articlesting1\u gvwArticles”]/tbody/tr[11]/td/table/tbody/tr/td['+str(
最大页数）+']/a'
通过xpath（路径）查找元素
titles=驱动程序。通过类名称（“articletitle”）查找元素
标题中的标题：
打印（标题.文本）

您的自行车到底出了什么问题？您不会重定向到下一页，也不会从以下页面获取值…？谢谢Andersson。我认为我的“#一个循环遍历不同页面的XPath（？）的循环”将重定向到每个页面，而其余的代码将刮取这些页面？Python是我开始学习的第一种语言，在过去的几天里我才开始使用selenium。代码示例中的缩进正确吗？或者底部代码应该缩进

title\u scraper

函数中？此外，您永远不会调用函数，并且引用该函数的局部变量（

newPage

）-除非是缩进错误…谢谢。我已经用title_scraper函数缩进了底部代码-现在Pycharm上没有显示错误。。。然而，当我跑步的时候，这并没有什么不同。我将如何调用此函数？我以为我有（？）