无法使用python检索Javascript生成的数据_Javascript_Python_Web Scraping

无法使用python检索Javascript生成的数据

javascript python web-scraping

无法使用python检索Javascript生成的数据,javascript,python,web-scraping,Javascript,Python,Web Scraping,我一直在尝试从这个URL中获取数据：在一天的大部分时间里，我都知道我的效率非常低。我最近刚刚学会了如何处理普通的html网站，似乎已经掌握了窍门。javascript驱动的程序被证明是痛苦的到目前为止，我一直在研究的刮刀——经过多个角度的研究，已经产生了相同的结果。下面是我正在使用的代码： from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.support.wait impor

我一直在尝试从这个URL中获取数据：在一天的大部分时间里，我都知道我的效率非常低。我最近刚刚学会了如何处理普通的html网站，似乎已经掌握了窍门。javascript驱动的程序被证明是痛苦的

到目前为止，我一直在研究的刮刀——经过多个角度的研究，已经产生了相同的结果。下面是我正在使用的代码：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

PHANTOMJS_PATH = './phantomjs.exe'

#Using PhantomJS to navigate the url
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa&txt_statelist=&txt_state=&ERR_LS_20161018_041816_21233=txt_statelist%7CLocation%7C20%7C0%7C%7C0')

wait = WebDriverWait(browser, 15)
# let's parse our html
soup = BeautifulSoup(browser.page_source, "html5lib")

# get all the games
test = soup.find_all('tr')

print test

我最大的问题是我找不到我想要的细节。下图：

我无法获取与该特定名称相关的URL。在获得URL后，我想进一步导航到用户以获取更多详细信息

因此，我的问题如下：

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Chrome()  
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
comments = soup.findAll('a')  
print comments

是否有更有效的方法以编程方式（在有限的时间内）返回您正在查找的数据

有没有更好的方法来查看在抓取时如何浏览javascript生成的站点

请让我知道，如果我需要提供更多的清晰度

谢谢

第二部分：

我采取了另一种方法，并且遇到了另一个问题

我尝试使用以下方法获取上面的标签：

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Chrome()  
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
comments = soup.findAll('a')  
print comments

在我打印的“评论”列表中，我要查找的特定元素没有出现。i、 e

然后，我尝试使用selenium功能：

从selenium导入webdriver
从selenium.common.exceptions导入NoTouchElementException
从selenium.webdriver.common.keys导入密钥
从bs4导入BeautifulSoup

browser = webdriver.Chrome('C:/Users/rschilder/Desktop/Finance24 Scrape/Accountant_scraper/chromedriver.exe')
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
browser.implicitly_wait(30)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#browser.quit()

print html

我在这方面面临的挑战是：

我不太确定如何使用selenium get功能搜索和获取特定元素（它不像Beauty soup那样直观）

即使使用selenium导航，我正在寻找的元素（如上所述）仍然没有出现在输出中

如果您只是想让它执行隐式等待，只需执行

浏览器。隐式等待（15）

。如果您在开发代码时使用头戴式浏览器（如Firefox或Chrome），这也会有所帮助，这样您就可以看到发生了什么。当我尝试加载页面（从美国）时，加载该页面花了一段时间，可能超过15秒。另外，请提供您在这段代码中遇到的任何错误。@Gator\u Python：我认为一旦隐式等待和浏览器帮助。我仍然无法返回我在JS部分中查找的数据。我正在寻找的元素的一个例子如下：

我想这是我试图返回的主要部分。如果您只想让它执行隐式等待，请简单地执行

浏览器。隐式等待（15）

我认为这是我试图返回的主要部分。