R 如何使用我的代码从网页中提取固有链接（错误：下标超出范围）？_R_Web Scraping_Rselenium

R 如何使用我的代码从网页中提取固有链接（错误：下标超出范围）？

r web-scraping

R 如何使用我的代码从网页中提取固有链接（错误：下标超出范围）？,r,web-scraping,rselenium,R,Web Scraping,Rselenium,我是一名网络新手，但需要博士项目的数据。为此，我从欧洲议会网站上提取欧洲议会议员不同活动的数据。具体地说，如果我有问题，我想从欧洲议会议员的个人页面中提取每个演讲的标题，特别是标题背后的链接。我使用的代码已经运行了好几次了，但是在这里我没有成功地获得链接，只获得了演讲的标题。对于链接，我得到错误消息“下标超出范围”。我正在使用RSelenium，因为在提取数据之前，我必须先单击各个页面上的多个加载按钮（就我看来，这使得rvest是一个复杂的选项）我现在基本上都在努力解决这个问题，我真的不知道如

我是一名网络新手，但需要博士项目的数据。为此，我从欧洲议会网站上提取欧洲议会议员不同活动的数据。具体地说，如果我有问题，我想从欧洲议会议员的个人页面中提取每个演讲的标题，特别是标题背后的链接。我使用的代码已经运行了好几次了，但是在这里我没有成功地获得链接，只获得了演讲的标题。对于链接，我得到错误消息“下标超出范围”。我正在使用RSelenium，因为在提取数据之前，我必须先单击各个页面上的多个加载按钮（就我看来，这使得rvest是一个复杂的选项）

我现在基本上都在努力解决这个问题，我真的不知道如何进一步。我的印象是css选择器实际上并没有捕获底层链接（因为它提取标题时没有问题），但类有一个复合名称（“ep-a_heading ep-layout_level2”），因此也不可能通过这种方式。我也尝试了Rvest（忽略了加载更多按钮的问题），但仍然没有找到这些链接

```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)

server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)

## this is one of the urls I will use, there are others, constructed all 
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all- 
activities/plenary-speeches/8'


browser$open() 
browser$navigate(url)

## now I identify the load more button and click on it as long as there 
##is a "load more" button on the page

more <- browser$findElement(using = "css", value=".erpl-activities- 
loadmore-button .ep_name")

while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}

## I get an error message doing this in the end but it is working anyway 
##(yes, I really am a beginner!)

##Now, what I want to extract are the title of the speech and most 
##importantly: the URL.

links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title") 
length(links) 


## there are 128 Speeches listed on the page

URL <- rep(NA, length(links))
Title <- rep(NA, length(links))

## after having created vectors to store the results, I apply the loop 
##function that had worked fine already many times to extract the data I 
##want

 for (i in 1:length(links)){
     URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
     Title[i] <- links[[i]]$getElementText()[[1]] 
    }

speeches <- data.frame(Title, URL)

非常感谢你的帮助，我已经在这个论坛上读了很多关于下标越界问题的帖子，但不幸的是我仍然无法解决这个问题

祝你今天愉快

我使用rvest获取信息似乎没有问题。不需要使用selenium的开销。您希望针对该类的

标记子级，即

.ep-layout\u level2 a

，以便能够访问

href

属性。同样的选择器也适用于selenium

library(rvest)
library(magrittr)

page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')

titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text()  %>% gsub("\\r\\n\\t+", "", .) 
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href") 
results <- data.frame(titles,links)

库（rvest）
图书馆（magrittr）
页面%html\u text（）%>%gsub（\\r\\n\\t+，“”，）
链接%html\u节点（'.ep-layout\u level2 a'）%>%html\u属性（，“href”）
结果这里有一个基于您提供的代码的工作解决方案：
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)

server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)

## this is one of the urls I will use, there are others, constructed all 
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'


browser$open() 
browser$navigate(url)

## now I identify the load more button and click on it as long as there 
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")

while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
  more$clickElement()
  Sys.sleep(1)}

## I get an error message doing this in the end but it is working anyway 
##(yes, I really am a beginner!)

##Now, what I want to extract are the title of the speech and most 
##importantly: the URL.

links <- browser$findElements(using="class", "ep-layout_level2") 

## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))

## after having created vectors to store the results, I apply the loop 
##function that had worked fine already many times to extract the data I 
##want

for (i in 1:length(links)){
  l=links[[i]]$findChildElement(using="css","a")

  URL[i] <-l$getElementAttribute('href')[[1]]
  Title[i] <- links[[i]]$getElementText()[[1]] 
}

speeches <- data.frame(Title, URL)

speeches

库（RSelenium）
图书馆（wdman）
库（rvest，warn.conflicts=FALSE）
图书馆（stringr）
服务器欢迎来到社区！硒是一个很难解决的问题，但由于rvest幸运地给了您同样的问题，我怀疑这更多地与您选择错误有关。我试图转到上面提到的超链接，但它似乎不起作用？你能查一下吗？还有，你看到这个了吗？我将在线，请稍后再查看。@AmitKohli谢谢！找到了我的答案，我很高兴现在成为社区的一员！下次我会更加关注超链接！谢谢非常感谢你！！！它确实也与Selenium一起工作（我确实需要Selenium来掌握“加载更多”按钮背后的内容，对吗？对于您的代码，我只得到前20个结果）。但是我不明白你是怎么找到这个选择器的——我认为它是错误的，并尝试了几种组合，但我不明白如何找到它——SelectorGadget给了我“.ep-layout_level2.ep_title”，而源代码中的类是“ep-a_heading ep-layou_level2”-你怎么知道正确的是“.ep-layou level2 a”？很抱歉，这个问题可能非常基本，再次感谢！我不喜欢选择器。它通常太冗长了。阅读选择器和练习。这个小玩意给你的是父母而不是孩子。谢谢！我想用你的建议来解决这个while循环问题。但是如果我只是说while（more$isElementDisplayed（）[[1]]）{more$clickElement（）Sys.sleep（1）}
，那么它只会点击一次，如果我在while（more$isElementDisplayed（）[[1]]）{more$clickElement（）Sys.sleep（1）{if（！more$isElementDisplayed（）[[1]]）break}中包含一条中断消息
然后它一直运行到最后，但我收到错误消息“Summary:StaleElementReference Detail:An element命令失败，因为引用的元素不再连接到DOM”。您有给我的建议吗？您好@Anne Sophie，我找到了一个工作解决方案，使用：（grepl（“erpl activity loadmore button”，more$getPageSource（），固定=真）。这检查与按钮相关的类是否仍然存在于源页面中。这样我获得了128个链接，没有错误。我更新了答案中的代码。让我知道它对您有效，非常感谢@Chelmy88！它确实是这样工作的，因此我可以运行一个完整的循环自动提取所有这些内容。这很好。只有我现在遇到的问题-当没有“加载更多”按钮时，我收到一条错误消息，但我可能只是添加了一个“其他”函数？
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)

server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)

## this is one of the urls I will use, there are others, constructed all 
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'


browser$open() 
browser$navigate(url)

## now I identify the load more button and click on it as long as there 
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")

while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
  more$clickElement()
  Sys.sleep(1)}

## I get an error message doing this in the end but it is working anyway 
##(yes, I really am a beginner!)

##Now, what I want to extract are the title of the speech and most 
##importantly: the URL.

links <- browser$findElements(using="class", "ep-layout_level2") 

## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))

## after having created vectors to store the results, I apply the loop 
##function that had worked fine already many times to extract the data I 
##want

for (i in 1:length(links)){
  l=links[[i]]$findChildElement(using="css","a")

  URL[i] <-l$getElementAttribute('href')[[1]]
  Title[i] <- links[[i]]$getElementText()[[1]] 
}

speeches <- data.frame(Title, URL)

speeches