Css 无接触异常用硒刮除ESPN_Css_R_Xpath_Web Scraping_Rselenium

Css 无接触异常用硒刮除ESPN

css r xpath web-scraping

Css 无接触异常用硒刮除ESPN,css,r,xpath,web-scraping,rselenium,Css,R,Xpath,Web Scraping,Rselenium,我用R（和铯）从ESPN中获取数据。这不是我第一次使用它，但在这种情况下，我得到了一个错误，我无法解决这个问题考虑一下这一页：让我们试着把时间线缩短一下。如果我检查页面，就会得到css选择器 #liveLeft 像往常一样，我和你一起去 checkForServer() remDr <- remoteDriver() remDr$open() matchId <- "142562" leagueString <- "premiership" seasonString &

我用R（和铯）从ESPN中获取数据。这不是我第一次使用它，但在这种情况下，我得到了一个错误，我无法解决这个问题

考虑一下这一页：

让我们试着把时间线缩短一下。如果我检查页面，就会得到css选择器

#liveLeft

像往常一样，我和你一起去

checkForServer()
remDr <- remoteDriver()
remDr$open()

matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"


url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")

remDr$navigate(url)

我很困惑。我也尝试过Xpath，但没有成功。我还尝试了在没有运气的情况下获取页面的不同元素。唯一返回某些内容的选择器是

#scrumContent

从评论中可以看出

该元素位于

iframe

中，因此无法选择该元素。在控制台中使用

chrome

和

document.getElementById（'liveLeft'）

时，会显示这一点。在整个页面上，它将返回

null

，即元素不存在，即使它清晰可见。要解决这个问题，只需加载

iframe

如果您查看页面，您将看到

iframe

的

scr

是

/premiership-2011-12/rugby/current/match/142562.html？view=scorecard

。导航到此页面而不是“完整”页面将允许元素“可见”，因此可以选择

RSelenium

checkForServer()
remDr <- remoteDriver()
remDr$open()

matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"

url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe

remDr$navigate(url)

div<- remDr$findElement(using = 'css selector','#liveLeft')

通常使用Selenium时，当您的网页带有框架/iFrame时，您需要使用

remoteDriver

类的

switchToFrame

方法：

library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[@id = 'liveLeft']"][[1]], header = TRUE)

库（RSelenium）
selServ这是我在查看页面时看到的：code……
如果我将div悬停在“liveLeft”上，我想要的表将亮起。是的，我刚刚注意到。有趣的是，如果要加载页面并在开发人员控制台输入document.getElementById（'liveLeft'）
它将返回null
。但是，当您随后检查元素并重新运行文档时，getElementById（'liveLeft'）
将返回。我不是js
方面的专家，但是可能有一些AJAX
正在进行，这意味着元素在原始树中可用，因此为什么它在重新评估节点树之前找不到它。它在iframe
中。不要加载当前的页面，要加载iframe
。如果您查看源代码，您将看到参考/premiership-2011-12/rugby/current/match/142562.html？view=scorecard
。如果您要加载它，然后查找它应该工作的元素。我还没有在

RSelenium中测试过，所以暂时不会把它作为答案，但它可以在chrome上的开发者工具中使用<代码>http://en.espn.co.uk/premiership-2011-12/rugby/current/match/142562.html?view=scorecard这很好用！谢谢。请将其添加为答案：）

checkForServer()
remDr <- remoteDriver()
remDr$open()

matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"

url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/current/match/",matchId,".html?view=scorecard")
# Amend url to return iframe

remDr$navigate(url)

div<- remDr$findElement(using = 'css selector','#liveLeft')

document.getElementById('liveLeft') # Will return null as iframe has seperate DOM

var doc = document.getElementById('win_old').contentDocument # Loads iframe DOM elements in the variable doc
doc.getElementById('liveLeft') # Will now return the desired element.

library(RSelenium)
selServ <- startServer()
remDr <- remoteDriver()
remDr$open()
matchId <- "142562"
leagueString <- "premiership"
seasonString <- "2011-12"
url <- paste0("http://en.espn.co.uk/",leagueString,"-",seasonString,"/rugby/match/",matchId,".html")
remDr$navigate(url)
# check the iframes
iframes <- htmlParse(remDr$getPageSource()[[1]])["//iframe", fun = function(x){xmlGetAttr(x, "id")}]
# iframes[[3]] == "win_old" contains the data switch to this frame
remDr$switchToFrame(iframes[[3]])
# check you can access the element
div<- remDr$findElement(using = 'css selector','#liveLeft')
div$highlightElement()
# get data
ifSource <- htmlParse(remDr$getPageSource()[[1]])
out <- readHTMLTable(ifSource["//div[@id = 'liveLeft']"][[1]], header = TRUE)