Css 无接触异常用硒刮除ESPN
我用R(和铯)从ESPN中获取数据。这不是我第一次使用它,但在这种情况下,我得到了一个错误,我无法解决这个问题 考虑一下这一页: 让我们试着把时间线缩短一下。如果我检查页面,就会得到css选择器Css 无接触异常用硒刮除ESPN,css,r,xpath,web-scraping,rselenium,Css,R,Xpath,Web Scraping,Rselenium,我用R(和铯)从ESPN中获取数据。这不是我第一次使用它,但在这种情况下,我得到了一个错误,我无法解决这个问题 考虑一下这一页: 让我们试着把时间线缩短一下。如果我检查页面,就会得到css选择器 #liveLeft 像往常一样,我和你一起去 checkForServer() remDr <- remoteDriver() remDr$open() matchId <- "142562" leagueString <- "premiership" seasonString &
#liveLeft
像往常一样,我和你一起去
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
我很困惑。我也尝试过Xpath,但没有成功。我还尝试了在没有运气的情况下获取页面的不同元素。唯一返回某些内容的选择器是
#scrumContent
从评论中可以看出
该元素位于iframe
中,因此无法选择该元素。在控制台中使用chrome
和document.getElementById('liveLeft')
时,会显示这一点。在整个页面上,它将返回null
,即元素不存在,即使它清晰可见。要解决这个问题,只需加载iframe
如果您查看页面,您将看到iframe
的scr
是/premiership-2011-12/rugby/current/match/142562.html?view=scorecard
。导航到此页面而不是“完整”页面将允许元素“可见”,因此可以选择RSelenium
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe
remDr$navigate(url)
div<- remDr$findElement(using = 'css selector','#liveLeft')
通常使用Selenium时,当您的网页带有框架/iFrame时,您需要使用
remoteDriver
类的switchToFrame
方法:
library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[@id = 'liveLeft']"][[1]], header = TRUE)
库(RSelenium)
selServ这是我在查看页面时看到的:code……
如果我将div悬停在“liveLeft”上,我想要的表将亮起。是的,我刚刚注意到。有趣的是,如果要加载页面并在开发人员控制台输入document.getElementById('liveLeft')
它将返回null
。但是,当您随后检查元素并重新运行文档时,getElementById('liveLeft')
将返回。我不是js
方面的专家,但是可能有一些AJAX
正在进行,这意味着元素在原始树中可用,因此为什么它在重新评估节点树之前找不到它。它在iframe
中。不要加载当前的页面,要加载iframe
。如果您查看源代码,您将看到参考/premiership-2011-12/rugby/current/match/142562.html?view=scorecard
。如果您要加载它,然后查找它应该工作的元素。我还没有在
RSelenium中测试过,所以暂时不会把它作为答案,但它可以在chrome上的开发者工具中使用<代码>http://en.espn.co.uk/premiership-2011-12/rugby/current/match/142562.html?view=scorecard这很好用!谢谢。请将其添加为答案:)
checkForServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe
remDr$navigate(url)
div<- remDr$findElement(using = 'css selector','#liveLeft')
document.getElementById('liveLeft') # Will return null as iframe has seperate DOM
var doc = document.getElementById('win_old').contentDocument # Loads iframe DOM elements in the variable doc
doc.getElementById('liveLeft') # Will now return the desired element.
library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[@id = 'liveLeft']"][[1]], header = TRUE)