通过谷歌playstore在R中抓取网页

通过谷歌playstore在R中抓取网页,r,web-scraping,rvest,data-extraction,R,Web Scraping,Rvest,Data Extraction,我想从google play store中抓取我想要的几个应用程序评论的数据 名称字段 他们有多少明星 他们写的评论 后来我试图发现更多我发现的名称\u数据\u html有 > Name_data_html {xml_nodeset (0)} 我是一个新的网页刮可以帮助我与此 您应该使用XPath在网页上选择对象: #Loading the rvest package library('rvest') #Specifying the url for desired website

我想从google play store中抓取我想要的几个应用程序评论的数据

  • 名称字段

  • 他们有多少明星

  • 他们写的评论

  • 后来我试图发现更多我发现的名称\u数据\u html有

    > Name_data_html
    {xml_nodeset (0)}
    

    我是一个新的网页刮可以帮助我与此

    您应该使用XPath在网页上选择对象:

    #Loading the rvest package
    library('rvest')
    #Specifying the url for desired website to be scrapped
    url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
    #Reading the HTML code from the website
    webpage <- read_html(url)
    # Using Xpath
    Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
    #Converting the Name data to text
    Name_data <- html_text(Name_data_html)
    #Look at the Name
    head(Name_data)
    
    #加载rvest包
    图书馆('rvest')
    #指定要废弃的所需网站的url
    
    url在分析了您的代码和您发布的url的源页面后,我认为您无法放弃任何内容的原因是因为内容是动态生成的,所以rvest无法正确获取

    以下是我的解决方案:

    #Loading the rvest package
    library(rvest)
    library(magrittr) # for the '%>%' pipe symbols
    library(RSelenium) # to get the loaded html of 
    
    #Specifying the url for desired website to be scrapped
    url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
    
    # starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
    selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
    shell(selCommand, wait = FALSE, minimized = TRUE)
    remDr <- remoteDriver(port = 4567L, browserName = "chrome")
    remDr$open()
    
    # go to website
    remDr$navigate(url)
    
    # get page source and save it as an html object with rvest
    html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
    
    # 1) name field (assuming that with 'name' you refer to the name of the reviewer)
    names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
    
    # 2) How much star they got 
    stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
    
    # 3) review they wrote
    reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
    
    # create the df with all the info
    review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
    
    #加载rvest包
    图书馆(rvest)
    库(magrittr)#用于“>%”管道符号
    库(RSelenium)#获取加载的html
    #指定要废弃的所需网站的url
    url%html\u text()
    #使用所有信息创建df
    
    我试着按照你的脚本做,但我有一个错误:selCommand我想这应该在另一个问题中提问。您的计算机中是否安装了Java?运行
    Sys.which(“java”)
    ,如果您没有找到java的路径,您应该先安装它。@Kardu检查这些链接,了解启动
    RSelenium
    :或
    #Loading the rvest package
    library('rvest')
    #Specifying the url for desired website to be scrapped
    url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
    #Reading the HTML code from the website
    webpage <- read_html(url)
    # Using Xpath
    Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
    #Converting the Name data to text
    Name_data <- html_text(Name_data_html)
    #Look at the Name
    head(Name_data)
    
    #Loading the rvest package
    library(rvest)
    library(magrittr) # for the '%>%' pipe symbols
    library(RSelenium) # to get the loaded html of 
    
    #Specifying the url for desired website to be scrapped
    url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
    
    # starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
    selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
    shell(selCommand, wait = FALSE, minimized = TRUE)
    remDr <- remoteDriver(port = 4567L, browserName = "chrome")
    remDr$open()
    
    # go to website
    remDr$navigate(url)
    
    # get page source and save it as an html object with rvest
    html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
    
    # 1) name field (assuming that with 'name' you refer to the name of the reviewer)
    names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
    
    # 2) How much star they got 
    stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
    
    # 3) review they wrote
    reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
    
    # create the df with all the info
    review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)