Javascript 以编程方式下载页面源代码中未显示的文本

Javascript 以编程方式下载页面源代码中未显示的文本,javascript,python,html,web-scraping,web-crawler,Javascript,Python,Html,Web Scraping,Web Crawler,我正在用Python编写一个爬虫程序。 给定一个网页,我用以下方式提取其Html内容: import urllib2 response = urllib2.urlopen('http://www.example.com/') html = response.read() 但是有些文本组件没有出现在Html页面源中,例如在重定向到索引中,请访问其中一个日期并查看特定的邮件。如果查看页面源,您将看到邮件文本没有出现在源中,但似乎是由JS加载的 如何以编程方式下载此文本?我不是python专家,但任

我正在用Python编写一个爬虫程序。 给定一个网页,我用以下方式提取其Html内容:

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
但是有些文本组件没有出现在Html页面源中,例如在重定向到索引中,请访问其中一个日期并查看特定的邮件。如果查看页面源,您将看到邮件文本没有出现在源中,但似乎是由JS加载的


如何以编程方式下载此文本?

我不是python专家,但任何函数(如urlopen)都只能获取静态HTML,而不能执行它。您需要的是某种浏览器引擎来实际解析和执行JavaScript。 这里似乎可以回答:


我不是python专家,但任何函数(如urlopen)都只能获得静态HTML,而不能执行它。您需要的是某种浏览器引擎来实际解析和执行JavaScript。 这里似乎可以回答:


这里最简单的选择是向负责电子邮件搜索的URL发出POST请求,并解析提到@recursive的JSON结果,因为他首先提出了这个想法。使用包的示例:

印刷品:

1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)
6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination
另一种方法是让真正的浏览器在浏览器自动化框架的帮助下处理页面加载的动态javascript部分:

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()  # can also be, for example, webdriver.Firefox()
driver.get('http://jebbushemails.com/email/search')

# click 1999-2000
button = driver.find_element_by_xpath('//button[contains(., "1999 – 2000")]')
button.click()

# click 20
cell = driver.find_element_by_xpath('//table[@role="grid"]//span[. = "20"]')
cell.click()

# click Submit
submit = driver.find_element_by_xpath('//button[span[1]/text() = "Submit"]')
submit.click()

# wait for result to appear
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//tr[@analytics-event]")))

# get the results
for row in driver.find_elements_by_xpath('//tr[@analytics-event]'):
    date, subject = row.find_elements_by_tag_name('td')
    print date.text, subject.text
印刷品:

1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)
6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination
请注意,这里的浏览器也可以是无头的,如。而且,如果没有可供浏览器使用的显示器,您可以启动一个虚拟显示器,请参见以下示例:


这里最简单的选择是向负责电子邮件搜索的URL发出POST请求,并解析提到@recursive的JSON结果,因为他首先提出了这个想法。使用包的示例:

印刷品:

1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)
6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination
另一种方法是让真正的浏览器在浏览器自动化框架的帮助下处理页面加载的动态javascript部分:

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()  # can also be, for example, webdriver.Firefox()
driver.get('http://jebbushemails.com/email/search')

# click 1999-2000
button = driver.find_element_by_xpath('//button[contains(., "1999 – 2000")]')
button.click()

# click 20
cell = driver.find_element_by_xpath('//table[@role="grid"]//span[. = "20"]')
cell.click()

# click Submit
submit = driver.find_element_by_xpath('//button[span[1]/text() = "Submit"]')
submit.click()

# wait for result to appear
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//tr[@analytics-event]")))

# get the results
for row in driver.find_elements_by_xpath('//tr[@analytics-event]'):
    date, subject = row.find_elements_by_tag_name('td')
    print date.text, subject.text
印刷品:

1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)
6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination
请注意,这里的浏览器也可以是无头的,如。而且,如果没有可供浏览器使用的显示器,您可以启动一个虚拟显示器,请参见以下示例:


您可以向实际的ajax服务发出请求,而不是尝试使用web界面

例如,使用此表单数据的post请求将产生80kb易于解析的json

year:1999
month:05
day:20
locale:en-us

您可以向实际的ajax服务发出请求,而不是尝试使用web界面

例如,使用此表单数据的post请求将产生80kb易于解析的json

year:1999
month:05
day:20
locale:en-us

回答得好。我想添加一些关于headless框架的内容,比如phantomjs或casper.js alsogood answer。我还想添加一些关于headless框架的内容,比如phantomjs或casper.js