使用selenium的Python web抓取:在模式框中查找元素并在循环中下载
我想为Python循环中的每个组织下载2017年名为“sprawozdanie merytoryczne”的文件。要手动下载文件,您必须访问网站:单击按钮“Znajdź”,然后单击组织名称-模式框将显示该特定组织的“sprawozdanie merytoryczne”链接。我想为所有组织自动完成这项工作。但我面临一些问题。在第一次运行循环期间,一切正常,下载第一个文件。但是,当它进入第二个窗口时,它会打开一个模态窗口,但它不会看到“sprawozdanie merytoryczne”,尽管它存在。我认为切换到windows是有问题的。我将非常感谢任何帮助。这是我的密码:使用selenium的Python web抓取:在模式框中查找元素并在循环中下载,python,selenium,pdf,web-scraping,download,Python,Selenium,Pdf,Web Scraping,Download,我想为Python循环中的每个组织下载2017年名为“sprawozdanie merytoryczne”的文件。要手动下载文件,您必须访问网站:单击按钮“Znajdź”,然后单击组织名称-模式框将显示该特定组织的“sprawozdanie merytoryczne”链接。我想为所有组织自动完成这项工作。但我面临一些问题。在第一次运行循环期间,一切正常,下载第一个文件。但是,当它进入第二个窗口时,它会打开一个模态窗口,但它不会看到“sprawozdanie merytoryczne”,尽管它存在
import urllib
import urllib.request
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import re
import unicodecsv # import whole module
import requests # import whole module
from bs4 import BeautifulSoup # import only things that we need
import time
import smtplib
from selenium import webdriver
chrome_path= r"C:\Users\username\AppData\Local\Programs\Python\Python35-
32\Scripts\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://sprawozdaniaopp.mpips.gov.pl/")
rok = driver.find_element_by_xpath("//*[@id='instanceYear']")
rok.send_keys('2017')
wojewodztwo = driver.find_element_by_xpath("//*[@id='Province']")
wojewodztwo.clear()
wojewodztwo.send_keys('MAZOWIECKIE')
elem = driver.find_element_by_xpath("//*[@id='btnsearch']/span")
elem.click()
for i in range(1, 1348):
winhandle = driver.current_window_handle
p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
p2 = ') > td:nth-child(3) > a'
p3 = p1 + str(i) + p2
elem1 = driver.find_element_by_css_selector(p3)
p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
p2 = ') > td:nth-child(5)'
p3 = p1 + str(i) + p2
miejscowosc = driver.find_element_by_css_selector(p3)
print(miejscowosc.text) #miejscowosc means city
miejscowosc1=miejscowosc.text
p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
p2 = ') > td:nth-child(4)'
p3 = p1 + str(i) + p2
wojewodztwo = driver.find_element_by_css_selector(p3)
elem1.click()
WebDriverWait(driver,
10).until(EC.presence_of_element_located((By.CSS_SELECTOR,".ui-
dialog.ui-widget.ui-widget-content.ui-corner-all")))
try:
elem2 = driver.find_element_by_link_text("Sprawozdanie
merytoryczne").click()
organizationName = driver.find_elements_by_class_name("td1")
orgname = str(organizationName[11].text)
orgname1 = orgname.replace('"', "")
print(organizationName[11].text)
driver.switch_to.window(driver.window_handles[1])
urltemp = driver.current_url
urltodownload= requests.get(urltemp)
path1 = r'C:/Users/adunajsk/Desktop/pdf17maz/'
path2 = '.pdf'
path3 = path1 + orgname1 + path2
print(path3)
with open(path3, 'wb') as f:
f.write(urltodownload.content)
driver.close()
del organizationName[:]
except NoSuchElementException:
print("Plik nie istnieje")
driver.switch_to.window(winhandle)
WebDriverWait(driver,
8).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body
> div.ui-dialog.ui-widget.ui-widget-content.ui-corner-all >
div.ui-dialog-titlebar.ui-widget-header.ui-corner-all.ui-helper-
clearfix > a > span")))
closebutton= driver.find_element_by_css_selector("body > div.ui-
dialog.ui-widget.ui-widget-content.ui-corner-all > div.ui-dialog-
titlebar.ui-widget-header.ui-corner-all.ui-helper-clearfix > a")
closebutton.click()
问题是,一旦打开模态对话框,即使关闭它,它也会留在DOM中。当您打开第二个定位器时,请找到第一个定位器并尝试单击该定位器。 您还可以配置驱动程序,直接下载pdf而无需打开它 此处代码: 不是:我用Java编写并测试了它,代码可能包含语法错误
#set chrome options to download pdf instead open it in browser, this will remove need to handle windows and make it much faster
options = webdriver.ChromeOptions()
downloadPath = r'C:\Users\username\Downloads'
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],"download.default_directory" : downloadPath}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome(r"C:\Users\username\AppData\Local\Programs\Python\Python35-32\Scripts\chromedriver.exe", chrome_options=options)
driver.get("http://sprawozdaniaopp.mpips.gov.pl/")
WebDriverWait(driver, 10).until(EC.visibility_of_element_located(By.ID, 'Province')).send_keys('MAZOWIECKIE')
driver.find_element_by_id('instanceYear').send_keys('2017')
driver.find_element_by_id('btnsearch').click()
#after search wait table to load data with column with MAZOWIECKIE text
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//table[@class="webgrid"]/tbody//td[normalize-space(.)="MAZOWIECKIE"]')))
#get all rows and iterate throw, make your code dinamically and not depends row size
rows = driver.find_elements_by_css_selector('table.webgrid tbody tr');
for row in rows:
#get KRS column number
krs = row.find_element_by_css_selector('td:nth-child(2)').text()
#click to link in Nazwa column
row.find_element_by_css_selector('td:nth-child(3) a').click()
#find modal box DIV element with KRS numeber got from click row. as option you can get all modal boxes and get one visible.
modalBoxLocator = "(//table[@id='tbldetails']//td[contains(.,'" + krs + "')]/ancestor::div[contains(@class,'ui-dialog')][2])[last()]"
modalBox = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, modalBoxLocator)))
#find TD with 2017 text and then click on first "Sprawozdanie merytoryczne" link after 2017
modalBox.find_element_by_xpath('.//tr[./td[.='2017']]/following-sibling::tr[.//a[.="Sprawozdanie merytoryczne"]][1]//a').click()
#close modal box
modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()
#if modalBox.find_elements_by_css_selector('a.ui-dialog-titlebar-close').size()>0:
# modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()
在循环的第二次运行中,“驱动程序。通过链接文本(“Sprawozdanie merytoryczne”)查找元素”。单击()”并没有成功完成,但在第一次运行中它能够正常运行。模式框中有多个“Sprawozdanie merytoryczne”。您需要单击所有的还是特定的?除了块,您的
块在哪里?哪一行导致异常?@sers仅2017年的一行year@Andersson除了NoTouchElementException:print(“Plik nie istnieje”)#表示文件不存在谢谢您的回复。我开始实现它,但我遇到了一个错误:modalBox=driver.find_element_by_xpath('(//table[@id=“tbldetails”]//td[contains(,“+krs+”)]/祖先::div[contains(@class,“ui dialog”)][2])[last()”)TypeError:cant无法将“WebElement”对象转换为str隐式更改为modalBox=driver.find_element_by_by_by_xpath(//table(//table[@id='tbldetails']//td[contains(,“+krs+”)]/祖先::div[contains(@class,'ui-dialog')][2])[last()”
在遵循您的指示并以常规方式更改krs=row之后。通过css\u选择器(“td:nth child(2)”查找元素。text()到krs=row。通过css\u选择器(“td:nth child(2)”查找元素.text出现:消息:没有这样的元素:找不到元素:{“方法”:“xpath”,“选择器”:“(//table[@id='tbldetails']//td[contains(,'0000009458')]/祖先::div[contains(@class,'ui-dialog')][2])[last()”}我添加了wait。还可以通过添加驱动程序来使用隐式。隐式_wait(10)另一个问题发生:(modalBox=WebDriverWait)(驱动程序,10).直到(EC.visibility_of_element_located(By.XPATH),(//table[@id='tbldetails']//td[contains(,“+krs+”))]/祖先::div[contains(@class,'ui-dialog')][2])[last()”)类型错误:u init_uu()接受2个位置参数,但给出了3个