通过谷歌playstore在R中抓取网页
我想从google play store中抓取我想要的几个应用程序评论的数据通过谷歌playstore在R中抓取网页,r,web-scraping,rvest,data-extraction,R,Web Scraping,Rvest,Data Extraction,我想从google play store中抓取我想要的几个应用程序评论的数据 名称字段 他们有多少明星 他们写的评论 后来我试图发现更多我发现的名称\u数据\u html有 > Name_data_html {xml_nodeset (0)} 我是一个新的网页刮可以帮助我与此 您应该使用XPath在网页上选择对象: #Loading the rvest package library('rvest') #Specifying the url for desired website
> Name_data_html
{xml_nodeset (0)}
我是一个新的网页刮可以帮助我与此 您应该使用XPath在网页上选择对象:
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
#加载rvest包
图书馆('rvest')
#指定要废弃的所需网站的url
url在分析了您的代码和您发布的url的源页面后,我认为您无法放弃任何内容的原因是因为内容是动态生成的,所以rvest无法正确获取
以下是我的解决方案:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
#加载rvest包
图书馆(rvest)
库(magrittr)#用于“>%”管道符号
库(RSelenium)#获取加载的html
#指定要废弃的所需网站的url
url%html\u text()
#使用所有信息创建df
我试着按照你的脚本做,但我有一个错误:selCommand我想这应该在另一个问题中提问。您的计算机中是否安装了Java?运行Sys.which(“java”)
,如果您没有找到java的路径,您应该先安装它。@Kardu检查这些链接,了解启动RSelenium
:或
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)