使用R从html页面提取数据_Html_Xml_R_Web Scraping_Readlines

使用R从html页面提取数据

html xml r web-scraping

使用R从html页面提取数据,html,xml,r,web-scraping,readlines,Html,Xml,R,Web Scraping,Readlines,我尝试从以下站点提取数据： https://www.zomato.com/ncr/restaurants/north-indian 使用R编程，我是这个领域的学习者和初学者我试过这些： > library(XML) > doc<-htmlParse("the url mentioned above") > Warning message: > XML content does not seem to be XML: 'https://www.zomato.c

我尝试从以下站点提取数据：

https://www.zomato.com/ncr/restaurants/north-indian

使用R编程，我是这个领域的学习者和初学者

我试过这些：

> library(XML)

> doc<-htmlParse("the url mentioned above")

> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'

我知道页面不是如错误说明中所示的XML，但我从该站点捕获数据的其他方法是什么……我尝试了tidy html将其转换为XML或XHTML，然后进行了处理，但我没有取得任何进展，也许我还不知道使用tidy html的实际过程！：不确定！建议解决此问题的方法，如果有，请进行更正？

我建议从RCurl包中获取获取文档内容的getURL。然后我们可以用htmlpasse解析它。有时候HTMLPase在某些内容上有问题。在这种情况下，建议使用getURL

另外，请注意，readLines不支持https，因此错误消息就不那么令人震惊了。

我建议从RCurl包获取URL以获取文档内容。然后我们可以用htmlpasse解析它。有时候HTMLPase在某些内容上有问题。在这种情况下，建议使用getURL

另外，请注意，readLines不支持https，因此错误消息就不那么令人震惊了。

rvest包也非常方便，并且构建在XML包之上，以及其他包中：

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"

rvest包也非常方便，构建在XML包之上，以及其他包：

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"

非常感谢。那有帮助！我有一个疑问，strdoc命令在这里做什么？声明类doc？@ParulChauhan str是对象结构的紧凑版本，有点像摘要。但是这个文档的str除了类之外没有显示任何内容。好的，我没有看到你在最近的编辑中使用了summary而不是str..快速保存！再次感谢你！那有帮助！我有一个疑问，strdoc命令在这里做什么？声明类doc？@ParulChauhan str是对象结构的紧凑版本，有点像摘要。但是这个文档的str除了类之外没有显示任何内容。好的，我没有看到你在最近的编辑中使用了summary而不是str..快速保存！又来了

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"