Python 3.x python中的selenium在尝试抓取数据时跳过了文章_Python 3.x_Selenium_Selenium Webdriver_Xpath_Web Scraping

Python 3.x python中的selenium在尝试抓取数据时跳过了文章

python-3.x selenium selenium-webdriver xpath web-scraping

Python 3.x python中的selenium在尝试抓取数据时跳过了文章,python-3.x,selenium,selenium-webdriver,xpath,web-scraping,Python 3.x,Selenium,Selenium Webdriver,Xpath,Web Scraping,我试图在python中使用selenium从文章中提取数据，代码正在识别文章，但在运行循环时，会随机跳过一些文章。如果您能帮助解决此问题，我们将不胜感激 #Importing libraries import requests import os import json from selenium import webdriver import pandas as pd from bs4 import BeautifulSoup import time import requests fro

我试图在python中使用selenium从文章中提取数据，代码正在识别文章，但在运行循环时，会随机跳过一些文章。如果您能帮助解决此问题，我们将不胜感激

#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup  
import time
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import traceback
from webdriver_manager.chrome import ChromeDriverManager  

#opening a chrome instance
options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options, executable_path=r"C:/selenium/chromedriver.exe")

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/div[3]/main/section/div/div/div[1]/div/div[3]/div[2]/div[3]/div/div/div/div/h5')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to_window(window1)
        driver.close()
        driver.switch_to_window(window0)
    except:
        print("couldnt get the article")

首先
，对于“收集所有文章”元素，您可以使用此css选择器：

articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

秒
，这是错误的方法：

driver.switch_to_window(window1)

它应该：

driver.switch_to.window(window1)

请参见上面的
和
之间的区别

第三个
，您忘记初始化

窗口0

变量：

window0 = driver.window_handles[0]

最后，请尝试以下代码：

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to.window(window1)
        driver.close()
        window0 = driver.window_handles[0]
        driver.switch_to.window(window0)
    except:
        print("couldnt get the article")

driver.quit()

这似乎有点简单，但您是否尝试过增加每次单击的等待时间？10秒钟可能不足以打开文章。您的代码没有太大的错误。减少XPATH选择器的长度可能是值得的。（By.XPATH，//h5[@class=“customLink item title”]”），它更干净一些。看了选择器后，你确定h5是你想要点击的，而不是哪个是直接子项吗？@AaronS我尝试过增加等待时间，但没有成功。是的，使用您建议的xpath使代码看起来有点干净。谢谢你，谢谢你。我是一个刮削领域的业余爱好者，所以我可以抽出几秒钟的时间告诉我切换到窗口和切换到窗口的区别。window@VenuBhaskar很高兴它成功了，事实上，到目前为止，我还没有在使用python的selenium中找到

switch\u to\u window

方法。我认为那是错误的方法。