Python 2.7 使用Selenium和/或Scrapy创建.ASPX站点_Python 2.7_Web Scraping_Scrapy

Python 2.7 使用Selenium和/或Scrapy创建.ASPX站点

python-2.7 web-scraping scrapy

Python 2.7 使用Selenium和/或Scrapy创建.ASPX站点,python-2.7,web-scraping,scrapy,Python 2.7,Web Scraping,Scrapy,我是Python/Selenium新手，在Python/Windows中编写了以下代码，以便在我的问题：网站是.aspx，所以我最初选择了Selenium。但是，如果您对下一步的编码有任何见解/建议，请参见下文。更具体地说，如果继续使用硒或加入scrapy更有效？非常感谢您的任何见解！：通过单击ChooseAdminister页面上的每个超链接PhysicianProfile.aspx？PhysicianID=XXXX，选择每个医生每页的超链接1-10。跟踪每个，并提取人口统计信息人口统

我是Python/Selenium新手，在Python/Windows中编写了以下代码，以便在

我的问题：网站是.aspx，所以我最初选择了Selenium。但是，如果您对下一步的编码有任何见解/建议，请参见下文。更具体地说，如果继续使用硒或加入scrapy更有效？非常感谢您的任何见解！：

通过单击ChooseAdminister页面上的每个超链接PhysicianProfile.aspx？PhysicianID=XXXX，选择每个医生每页的超链接1-10。跟踪每个，并提取人口统计信息人口统计信息：物理名称、lic发行日期、基本工作设置等返回到，选择医师页面，单击下一步对其他5474名医生重复上述步骤

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()                         driver.get('http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1')

#Locate the elements
zip = driver.find_element_by_xpath("//*[@id=\"ctl00_ContentPlaceHolder1_txtZip\"]")
select = Select(driver.find_element_by_xpath("//select[@id=\"ctl00_ContentPlaceHolder1_cmbDistance\"]"))
print select.options
print [o.text for o in select.options]
select.select_by_visible_text("15")
prim_care_chekbox = driver.find_element_by_xpath("//*[@id=\"ctl00_ContentPlaceHolder1_SpecialtyGroupsCheckbox_6\"]")
find_phy_button = driver.find_element_by_xpath("//*[@id=\"ctl00_ContentPlaceHolder1_btnSearch\"]")


#Input zipcode, check "primary care box", and click "find phy" button
zip.send_keys("02109")
prim_care_chekbox.click()
find_phy_button.click()

#wait for "ChooseAPhysician" page to open
wait = WebDriverWait(driver, 10)

open_phy_bio = driver.find_element_by_xpath("//*[@id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a")
element = wait.until(EC.element_to_be_selected(open_phy_bio))
open_phy_bio.click()

links = self.driver.find_element_by_xpath("//*[@id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a")
for link in links:
    link = link.get_attribute("href")
    self.driver.get(link)

def parse(self, response):
item = SummaryItem()
sel = self.selenium
sel.open(response.url)
time.sleep(4) 
item["phy_name"] = driver.find_elements_by_xpaths("//*[@id=\"content\"]/center/p[1]").extract() 
item["lic_status"] = driver.find_elements_by_xpaths("//*[@id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[2]/td[2]/a[1]").extract()
item["lic_issue_date"] = driver.find.elements_by_xpaths("//*[@id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[3]/td[2]").extract()
item["prim_worksetting"] = driver.find.elements_by_xpaths("//*[@id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[5]/td[2]").extract()
item["npi"] = driver.find_elements_by_xpaths("//*[@id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[2]/table/tbody/tr[6]/td[2]").extract()
item["Med_sch_grad_date"] = driver.find_elements_by_xpaths("//*[@id=\"content\"]/center/table[3]/tbody/tr[3]/td/table/tbody/tr[2]/td[2]").extract()
item["Area_of_speciality"] = driver.find_elements_by_xpaths("//*[@id=\"content\"]/center/table[4]/tbody/tr[3]/td/table/tbody/tr/td[2]").extract()
item["link"] =  driver.find_element_by_xpath("//*[@id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a").extract()


return item

selenium和浏览器交互始终是scrapy设置中的瓶颈，为了提高效率，您需要避免使用selenium，或者至少尽可能少地使用它。我想这也与selenium所需的等待时间有关吧？Selenium的安装是一个繁琐的过程，因此请参阅其他问题。有什么好的，python scraping.ASPX，可以推荐的资源吗？提前谢谢@alecxe@EricJohn尝试使用Scrapy a python框架我查看了您推荐的Scrapy链接和许多其他链接以重新格式化我的代码，但是收到以下错误：TypeError:“Rule”对象不可编辑这里是指向我更新的问题的链接：。再次感谢您的帮助！再次感谢！