Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用selenium的Python web抓取:在模式框中查找元素并在循环中下载_Python_Selenium_Pdf_Web Scraping_Download - Fatal编程技术网

使用selenium的Python web抓取:在模式框中查找元素并在循环中下载

使用selenium的Python web抓取:在模式框中查找元素并在循环中下载,python,selenium,pdf,web-scraping,download,Python,Selenium,Pdf,Web Scraping,Download,我想为Python循环中的每个组织下载2017年名为“sprawozdanie merytoryczne”的文件。要手动下载文件,您必须访问网站:单击按钮“Znajdź”,然后单击组织名称-模式框将显示该特定组织的“sprawozdanie merytoryczne”链接。我想为所有组织自动完成这项工作。但我面临一些问题。在第一次运行循环期间,一切正常,下载第一个文件。但是,当它进入第二个窗口时,它会打开一个模态窗口,但它不会看到“sprawozdanie merytoryczne”,尽管它存在

我想为Python循环中的每个组织下载2017年名为“sprawozdanie merytoryczne”的文件。要手动下载文件,您必须访问网站:单击按钮“Znajdź”,然后单击组织名称-模式框将显示该特定组织的“sprawozdanie merytoryczne”链接。我想为所有组织自动完成这项工作。但我面临一些问题。在第一次运行循环期间,一切正常,下载第一个文件。但是,当它进入第二个窗口时,它会打开一个模态窗口,但它不会看到“sprawozdanie merytoryczne”,尽管它存在。我认为切换到windows是有问题的。我将非常感谢任何帮助。这是我的密码:

import urllib
import urllib.request
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import re
import unicodecsv  # import whole module
import requests  # import whole module
from bs4 import BeautifulSoup  # import only things that we need
import time
import smtplib
from selenium import webdriver
chrome_path= r"C:\Users\username\AppData\Local\Programs\Python\Python35- 
32\Scripts\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://sprawozdaniaopp.mpips.gov.pl/")

rok = driver.find_element_by_xpath("//*[@id='instanceYear']")
rok.send_keys('2017') 

wojewodztwo = driver.find_element_by_xpath("//*[@id='Province']")
wojewodztwo.clear()
wojewodztwo.send_keys('MAZOWIECKIE')  
elem = driver.find_element_by_xpath("//*[@id='btnsearch']/span")
elem.click()
for i in range(1, 1348):
    winhandle = driver.current_window_handle
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(3) > a'
    p3 = p1 + str(i) + p2
    elem1 = driver.find_element_by_css_selector(p3)
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(5)'
    p3 = p1 + str(i) + p2
    miejscowosc = driver.find_element_by_css_selector(p3)
    print(miejscowosc.text) #miejscowosc means city
    miejscowosc1=miejscowosc.text
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(4)'
    p3 = p1 + str(i) + p2
    wojewodztwo = driver.find_element_by_css_selector(p3)
    elem1.click()

    WebDriverWait(driver, 
    10).until(EC.presence_of_element_located((By.CSS_SELECTOR,".ui- 
    dialog.ui-widget.ui-widget-content.ui-corner-all")))


    try:
        elem2 = driver.find_element_by_link_text("Sprawozdanie 
        merytoryczne").click()
        organizationName = driver.find_elements_by_class_name("td1")
        orgname = str(organizationName[11].text)

        orgname1 = orgname.replace('"', "")
        print(organizationName[11].text)

        driver.switch_to.window(driver.window_handles[1])
        urltemp = driver.current_url
        urltodownload=  requests.get(urltemp)

        path1 = r'C:/Users/adunajsk/Desktop/pdf17maz/'
        path2 = '.pdf'
        path3 = path1 + orgname1 + path2
        print(path3)
        with open(path3, 'wb') as f:
                f.write(urltodownload.content)
        driver.close()

        del organizationName[:] 
    except NoSuchElementException:
        print("Plik nie istnieje")

    driver.switch_to.window(winhandle)

    WebDriverWait(driver, 
    8).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body 
    > div.ui-dialog.ui-widget.ui-widget-content.ui-corner-all > 
    div.ui-dialog-titlebar.ui-widget-header.ui-corner-all.ui-helper- 
    clearfix > a > span")))

    closebutton= driver.find_element_by_css_selector("body > div.ui- 
    dialog.ui-widget.ui-widget-content.ui-corner-all > div.ui-dialog- 
    titlebar.ui-widget-header.ui-corner-all.ui-helper-clearfix > a")
    closebutton.click()

问题是,一旦打开模态对话框,即使关闭它,它也会留在DOM中。当您打开第二个定位器时,请找到第一个定位器并尝试单击该定位器。 您还可以配置驱动程序,直接下载pdf而无需打开它

此处代码:

不是:我用Java编写并测试了它,代码可能包含语法错误

    #set chrome options to download pdf instead open it in browser, this will remove need to handle windows and make it much faster
    options = webdriver.ChromeOptions()
    downloadPath = r'C:\Users\username\Downloads'
    profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],"download.default_directory" : downloadPath}
    options.add_experimental_option("prefs",profile)
    driver = webdriver.Chrome(r"C:\Users\username\AppData\Local\Programs\Python\Python35-32\Scripts\chromedriver.exe", chrome_options=options)

    driver.get("http://sprawozdaniaopp.mpips.gov.pl/")
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located(By.ID, 'Province')).send_keys('MAZOWIECKIE')
    driver.find_element_by_id('instanceYear').send_keys('2017')
    driver.find_element_by_id('btnsearch').click()

    #after search wait table to load data with column with MAZOWIECKIE text
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//table[@class="webgrid"]/tbody//td[normalize-space(.)="MAZOWIECKIE"]')))

    #get all rows and iterate throw, make your code dinamically and not depends row size
    rows = driver.find_elements_by_css_selector('table.webgrid tbody tr');
    for row in rows:
        #get KRS column number
        krs = row.find_element_by_css_selector('td:nth-child(2)').text()
        #click to link in Nazwa column
        row.find_element_by_css_selector('td:nth-child(3) a').click()
        #find modal box DIV element with KRS numeber got from click row. as option you can get all modal boxes and get one visible.
        modalBoxLocator = "(//table[@id='tbldetails']//td[contains(.,'" + krs + "')]/ancestor::div[contains(@class,'ui-dialog')][2])[last()]"  
        modalBox = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, modalBoxLocator)))
        #find TD with 2017 text and then click on first "Sprawozdanie merytoryczne" link after 2017
        modalBox.find_element_by_xpath('.//tr[./td[.='2017']]/following-sibling::tr[.//a[.="Sprawozdanie merytoryczne"]][1]//a').click()
        #close modal box
        modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()

        #if modalBox.find_elements_by_css_selector('a.ui-dialog-titlebar-close').size()>0:
        #   modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()

在循环的第二次运行中,“驱动程序。通过链接文本(“Sprawozdanie merytoryczne”)查找元素”。单击()”并没有成功完成,但在第一次运行中它能够正常运行。模式框中有多个“Sprawozdanie merytoryczne”。您需要单击所有的还是特定的?除了块,您的
块在哪里?哪一行导致异常?@sers仅2017年的一行year@Andersson除了NoTouchElementException:print(“Plik nie istnieje”)#表示文件不存在谢谢您的回复。我开始实现它,但我遇到了一个错误:modalBox=driver.find_element_by_xpath('(//table[@id=“tbldetails”]//td[contains(,“+krs+”)]/祖先::div[contains(@class,“ui dialog”)][2])[last()”)TypeError:cant无法将“WebElement”对象转换为str隐式更改为
modalBox=driver.find_element_by_by_by_xpath(//table(//table[@id='tbldetails']//td[contains(,“+krs+”)]/祖先::div[contains(@class,'ui-dialog')][2])[last()”
在遵循您的指示并以常规方式更改krs=row之后。通过css\u选择器(“td:nth child(2)”查找元素。text()到krs=row。通过css\u选择器(“td:nth child(2)”查找元素.text出现:消息:没有这样的元素:找不到元素:{“方法”:“xpath”,“选择器”:“(//table[@id='tbldetails']//td[contains(,'0000009458')]/祖先::div[contains(@class,'ui-dialog')][2])[last()”}我添加了wait。还可以通过添加驱动程序来使用隐式。隐式_wait(10)另一个问题发生:(modalBox=WebDriverWait)(驱动程序,10).直到(EC.visibility_of_element_located(By.XPATH),(//table[@id='tbldetails']//td[contains(,“+krs+”))]/祖先::div[contains(@class,'ui-dialog')][2])[last()”)类型错误:u init_uu()接受2个位置参数,但给出了3个