Xml 用户评论的数据提取_Xml_R_Web Crawler_Data Extraction

Xml 用户评论的数据提取

xml r web-crawler

Xml 用户评论的数据提取,xml,r,web-crawler,data-extraction,Xml,R,Web Crawler,Data Extraction,我试图学习R是出于我个人的自学兴趣。既不是编码员也不是分析师。我想从Trip Advisor中提取用户评论。在单页中，我们有10条评论，但使用下面的代码，我也得到了不需要的评论/行。我不确定是否使用了正确的html节点。此外，我想提取一个用户的完整评论，但它的结尾给了我一个用户的部分评论。你能帮我提取count 10的完整用户评论吗？非常感谢你的帮助 dat <- readLines("http://www.tripadvisor.in/Hotel_Review-g60763-d934

我试图学习R是出于我个人的自学兴趣。既不是编码员也不是分析师。我想从Trip Advisor中提取用户评论。在单页中，我们有10条评论，但使用下面的代码，我也得到了不需要的评论/行。我不确定是否使用了正确的html节点。此外，我想提取一个用户的完整评论，但它的结尾给了我一个用户的部分评论。你能帮我提取count 10的完整用户评论吗？非常感谢你的帮助

  dat <- readLines("http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html", warn=FALSE)
  raw2 <- htmlTreeParse(dat, useInternalNodes = TRUE)
  ##User Review
  plain.text <- xpathSApply(raw2, "//div[@class='col2of2']//p[@class='partial_entry']", xmlValue)
  UR <-gsub("\\\n","",plain.text)
  Result <- unlist(UR)
  Result

dat这更像是一个web抓取练习，而不是R编程
在R中，我更喜欢使用httr
包获取http响应并将内容提取为解析的html。使用readLines（…）
几乎是最糟糕的方法。因此，下面的代码将提取审查摘要
library(httr)
library(XML)
url <- "http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html"
response <- GET(url)
doc      <- content(response,type="text/html")
smry     <- xpathSApply(doc,'//div[@class="entry"]/p[@class="partial_entry"]',xmlValue)
length(smry)
# [1] 10
smry[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is...\n\n\nMore  \n\n"

其中，{xxx}
和{yyy}
是酒店独有的，与原始url中的相同，查询字符串
在网络监控工具中完全识别。因此，我们使用该url和适当的查询字符串形成一个新的http请求，并解析结果，如下所示
cls   <- doc['//div[@class="entry"]//span[contains(@class,"moreLink")]/@class']
xr.refno <- sapply(cls,function(x)sub(".*\\str(\\d+)\\s.*","\\1",x))
code     <- sub(".*Hotel_Review(\\-g\\d+\\-d\\d+)\\-Reviews.*","\\1",url)
xr.url   <- paste0("http://www.tripadvisor.com/ExpandedUserReviews",code)
xr.response <- GET(xr.url,query=list(target=xr.refno[1],
                                     context=1,
                                     reviews=paste(xr.refno,collapse=","),
                                     servlet="Hotel_Review",
                                     expand=1))
xr.doc   <- content(xr.response,type="text/html")
xr.full  <- xpathSApply(xr.doc,'//div[@class="entry"]/p',xmlValue)
length(xr.full)
# [1] 6
xr.full[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is about average in NYC I think. Asked for a room with a good view and was given a 2 BR on the 30th floor. After checking in I realized there may not be the kind of view that I expected at all from any room in this hotel - due to it being surrounded by high rises in all directions. However, no other complaints as such - except may that the bathroom was a bit too cramped. That I guess is the norm in NYC. I would stay here again if it was a business visit based on the location. Faster than avg wifi (free) was a good plus.\n"

cls谢谢……这对我这个学习者来说真的很有帮助，它会给我很多学习新事物的机会。你的解释很简单，它激励我去探索更多。现在我正试图在一次运行中获得用户的评论，这些用户有完整的评论和“部分评论”的完整评论。正如你提到的，我们得到了6个完整的评论，它们是“扩展评论”。我们也可以从该页面获得其余4条评论吗？所以总共有10条评论。你必须找出所有部分评论的ID。检查html。
cls   <- doc['//div[@class="entry"]//span[contains(@class,"moreLink")]/@class']
xr.refno <- sapply(cls,function(x)sub(".*\\str(\\d+)\\s.*","\\1",x))
code     <- sub(".*Hotel_Review(\\-g\\d+\\-d\\d+)\\-Reviews.*","\\1",url)
xr.url   <- paste0("http://www.tripadvisor.com/ExpandedUserReviews",code)
xr.response <- GET(xr.url,query=list(target=xr.refno[1],
                                     context=1,
                                     reviews=paste(xr.refno,collapse=","),
                                     servlet="Hotel_Review",
                                     expand=1))
xr.doc   <- content(xr.response,type="text/html")
xr.full  <- xpathSApply(xr.doc,'//div[@class="entry"]/p',xmlValue)
length(xr.full)
# [1] 6
xr.full[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is about average in NYC I think. Asked for a room with a good view and was given a 2 BR on the 30th floor. After checking in I realized there may not be the kind of view that I expected at all from any room in this hotel - due to it being surrounded by high rises in all directions. However, no other complaints as such - except may that the bathroom was a bit too cramped. That I guess is the norm in NYC. I would stay here again if it was a business visit based on the location. Faster than avg wifi (free) was a good plus.\n"