用于文本提取的rvest与RSelenium结果

用于文本提取的rvest与RSelenium结果,r,rvest,rselenium,R,Rvest,Rselenium,到目前为止,我正在使用RSelenium提取主页的文本,但我想切换到一个快速解决方案,如rvest library(rvest) url = 'https://www.r-bloggers.com' rvestResults <- read_html(url) %>% html_node('body') %>% html_text() library(RSelenium) remDr$navigate(url) rSelResults <- remDr$find

到目前为止,我正在使用
RSelenium
提取主页的文本,但我想切换到一个快速解决方案,如
rvest

library(rvest)
url = 'https://www.r-bloggers.com'
rvestResults <- read_html(url) %>%
  html_node('body') %>%
  html_text()

library(RSelenium)
remDr$navigate(url)
rSelResults <- remDr$findElement(
  using = "xpath",
  value = "//body"
)$getElementText()
也许,这是一个PhantomJS实现,可以做得更好(目前无法针对RSelenium进行测试):

库(“webdriver”)
图书馆(“rvest”)
pjs[1]“\n\n\t\t\t\t\n\R(750)个博客作者提供的新闻和教程\n主页\n浏览\nRSS\n添加您的博客!\n学习R\nR作业\n提交新作业(免费)\n\t浏览最新作业(也免费)\n\n请联系我们\n\n\n\n\n\n\n\n\t\t\tWelcome!\t\t\t\n\n\n\n您会发现关于R的每日新闻和教程,有750多名博主提供。\n\n有很多方法可以通过电子邮件跟踪我们:\n\n\n也许,这是一个PhantomJS实现,会做得更好(目前无法对RSelenium进行测试):

库(“webdriver”)
图书馆(“rvest”)

pjs[1]“\n\n\t\t\t\t\n\R由(750)个博客作者提供的新闻和教程\n Home\n out\nRSS\n添加您的博客\nLearn R\nR jobs\n提交新作业(免费)\n\t浏览最新作业(也免费)\n\n请与我们联系\n\n\n\n\n\n\n\n\n\t\t电子邮件\t\t\t\n\n\n\n\n您将发现关于R的每日新闻和教程,由750多名博客作者提供\n\n有许多方法可以跟踪我们-\n通过电子邮件:\n\n\n您可以尝试使用regex清理数据

url <- "https://www.r-bloggers.com"

res <- url %>% 
  read_html() %>% 
  html_nodes('body') %>%
  html_text()

library(stringr)

# clean up text data
res %>%
  str_replace_all(pattern = "\n", replacement = " ") %>%
  str_replace_all(pattern = "[\\^]", replacement = " ") %>%
  str_replace_all(pattern = "\"", replacement = " ") %>%
  str_replace_all(pattern = "\\s+", replacement = " ") %>%
  str_trim(side = "both")
url%
html_节点('body')%%>%
html_text()
图书馆(stringr)
#清理文本数据
回复%>%
str_replace_all(pattern=“\n”,replacement=“)%>%
str\u replace\u all(pattern=“[\\^]”,replacement=“”)%%
str\u replace\u all(pattern=“\”,replacement=“”)%>%
str_replace_all(pattern=“\\s+”,replacement=“)%%
str_饰件(侧边=“两侧”)

您可以尝试使用regex清理数据

url <- "https://www.r-bloggers.com"

res <- url %>% 
  read_html() %>% 
  html_nodes('body') %>%
  html_text()

library(stringr)

# clean up text data
res %>%
  str_replace_all(pattern = "\n", replacement = " ") %>%
  str_replace_all(pattern = "[\\^]", replacement = " ") %>%
  str_replace_all(pattern = "\"", replacement = " ") %>%
  str_replace_all(pattern = "\\s+", replacement = " ") %>%
  str_trim(side = "both")
url%
html_节点('body')%%>%
html_text()
图书馆(stringr)
#清理文本数据
回复%>%
str_replace_all(pattern=“\n”,replacement=“)%>%
str\u replace\u all(pattern=“[\\^]”,replacement=“”)%%
str\u replace\u all(pattern=“\”,replacement=“”)%>%
str_replace_all(pattern=“\\s+”,replacement=“)%%
str_饰件(侧边=“两侧”)

谢谢您的回答。这的确是一种进步。然而,res中仍然有JavaScript代码,所以我会投票表决,但如果还可以的话,我还不接受,…这很好。我很高兴看到一个比我自己更好的解决方案。但出于兴趣,您在res中看到了哪些JS代码?我可以看到一些HTML块,但没有JS。谢谢你的回答。这的确是一种进步。然而,res中仍然有JavaScript代码,所以我会投票表决,但如果还可以的话,我还不接受,…这很好。我很高兴看到一个比我自己更好的解决方案。但出于兴趣,您在res中看到了哪些JS代码?我可以看到一些HTML块,但没有JS。
library("webdriver")
library("rvest")

pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
url <- 'https://www.r-bloggers.com'
ses$go(url)

res <- ses$getSource() %>% 
  read_html() %>%
  html_node('body') %>%
  html_text()

substring(res, 1, 500)
#> [1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!\t\t\t\n\n\n\n\nHere you will find daily news and tutorials about R, contributed by over 750 bloggers. \n\nThere are many ways to follow us - \nBy e-mail:\n\n\n<img src=\"https://feeds.feedburner.com/~fc/RBloggers?bg=99CCFF&amp;fg=444444&amp;anim=0\" height=\"26\" width=\"88\" sty"
url <- "https://www.r-bloggers.com"

res <- url %>% 
  read_html() %>% 
  html_nodes('body') %>%
  html_text()

library(stringr)

# clean up text data
res %>%
  str_replace_all(pattern = "\n", replacement = " ") %>%
  str_replace_all(pattern = "[\\^]", replacement = " ") %>%
  str_replace_all(pattern = "\"", replacement = " ") %>%
  str_replace_all(pattern = "\\s+", replacement = " ") %>%
  str_trim(side = "both")