无法使用R XML包从已删除的HTML页面中提取文本
我试图提取《纽约时报》电影评论的主体,以便对它们进行语义分析。不幸的是,我的HTML+R+XML软件包技能不足以完成这项工作。我可以使用NYT movies API的XML输出来获取电影的详细信息,但我不知道如何使用文章API或直接的网页摘要来获取评论的主体 电影详细信息的工作代码:无法使用R XML包从已删除的HTML页面中提取文本,xml,r,Xml,R,我试图提取《纽约时报》电影评论的主体,以便对它们进行语义分析。不幸的是,我的HTML+R+XML软件包技能不足以完成这项工作。我可以使用NYT movies API的XML输出来获取电影的详细信息,但我不知道如何使用文章API或直接的网页摘要来获取评论的主体 电影详细信息的工作代码: library(RCurl) nyt.x.url<-'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&am
library(RCurl)
nyt.x.url<-'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&api-key=YOUR-OWN-FREE-API-KEY-GOES-HERE'
nyt.x.out<-getURLContent(nyt.x.url,curl=getCurlHandle())
library(XML)
a <- xmlTreeParse(nyt.x.url)
r <- xmlRoot(a)
# need to put the separate list items together into a mtrix, before they can be turned to a dataframe
nyt.df <- as.data.frame(stringsAsFactors=FALSE,
matrix(c(as.character(r[[4]][[1]][[1]][[1]])[6], # display name
as.character(r[[4]][[1]][[3]][[1]])[6], # rating - agrees with rotten tomatoes, but not imdb
as.character(r[[4]][[1]][[4]][[1]])[6], # is it a critics pick
as.character(r[[4]][[1]][[5]][[1]])[6], # is it a thousand best
as.character(r[[4]][[1]][[11]][[1]])[6], # opening date
as.character(r[[4]][[1]][[15]][[1]][[1]])[6]), # this is really the URL....
nrow=1,
ncol=6))
# now apply the right names
colnames(nyt.df) <- c("Title","MPAA-Rating", "Critics.Pick", "Thousand.Best", "Release.Date", "Article.URL")
库(RCurl)
我想这正是你想要的。可能有一种方法可以直接从API执行您想要的操作,但我没有对此进行调查
# load package
library(XML)
# grabs text from new york times movie page.
grab_nyt_text <- function(u) {
doc <- htmlParse(u)
txt <- xpathSApply(doc, '//div[@class="articleBody"]//p', xmlValue)
txt <- paste(txt, collapse = "\n")
free(doc)
return(txt)
}
###--- Main ---###
# Step 1: api URL
nyt.x.url <- 'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&api-key=YOUR-OWN-FREE-API-KEY-GOES-HERE'
# Step 2: Parse XML of webpage pointed to by URL
doc <- xmlParse(nyt.x.url)
# Step 3: Parse XML and extract some values using XPath expressions
df <- data.frame(display.title = xpathSApply(doc, "//results//display_title", xmlValue),
critics.pick = xpathSApply(doc, "//results//critics_pick", xmlValue),
thousand.best = xpathSApply(doc, "//results//thousand_best", xmlValue),
opening.date = xpathSApply(doc, "//results//opening_date", xmlValue),
url = xpathSApply(doc, "//results//link[@type='article']/url", xmlValue),
stringsAsFactors=FALSE)
df
# display.title critics.pick thousand.best opening.date url
#1 The Hangover 0 0 2009-06-05 http://movies.nytimes.com/2009/06/05/movies/05hang.html
#2 The Hangover Part II 0 0 2011-05-26 http://movies.nytimes.com/2011/05/26/movies/the-hangover-part-ii-3-men-and-a-monkey-baby.html
# Step 4: clean up - remove doc from memory
free(doc)
# Step 5: crawl article links and grab text
df$text <- sapply(df$url, grab_nyt_text)
# Step 6: inspect txt
cat(df$text[1])
#加载包
库(XML)
#从纽约时报电影页面抓取文本。
grab_nyt_文本改进了我的工作代码,并且在这里和那里添加了trycatch,将会更好。谢谢然而,我的问题是倒数第二段的代码格式不正确,我无法在iPad上更正。这里的问题是,我无法将HTML拆开,以获得评论的6段左右,这正是我所追求的。(我想对电影评论进行一些情绪分析)@AndrewDempsey抱歉,伙计,我不太明白。如果你问如何通过API获得文章的全文,那么我不知道。我的代码做了一个屏幕抓取(根据你问题中的“直接网页抓取”)来获取文章的主要文本(我现在已经更新了代码,只要抓取段落就可以了,你的意思是什么?)(自我提醒-不要在ipad上深夜阅读代码)@抱歉,我昨晚有点误读了代码。一切都很好。谢谢你,伙计。
# load package
library(XML)
# grabs text from new york times movie page.
grab_nyt_text <- function(u) {
doc <- htmlParse(u)
txt <- xpathSApply(doc, '//div[@class="articleBody"]//p', xmlValue)
txt <- paste(txt, collapse = "\n")
free(doc)
return(txt)
}
###--- Main ---###
# Step 1: api URL
nyt.x.url <- 'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&api-key=YOUR-OWN-FREE-API-KEY-GOES-HERE'
# Step 2: Parse XML of webpage pointed to by URL
doc <- xmlParse(nyt.x.url)
# Step 3: Parse XML and extract some values using XPath expressions
df <- data.frame(display.title = xpathSApply(doc, "//results//display_title", xmlValue),
critics.pick = xpathSApply(doc, "//results//critics_pick", xmlValue),
thousand.best = xpathSApply(doc, "//results//thousand_best", xmlValue),
opening.date = xpathSApply(doc, "//results//opening_date", xmlValue),
url = xpathSApply(doc, "//results//link[@type='article']/url", xmlValue),
stringsAsFactors=FALSE)
df
# display.title critics.pick thousand.best opening.date url
#1 The Hangover 0 0 2009-06-05 http://movies.nytimes.com/2009/06/05/movies/05hang.html
#2 The Hangover Part II 0 0 2011-05-26 http://movies.nytimes.com/2011/05/26/movies/the-hangover-part-ii-3-men-and-a-monkey-baby.html
# Step 4: clean up - remove doc from memory
free(doc)
# Step 5: crawl article links and grab text
df$text <- sapply(df$url, grab_nyt_text)
# Step 6: inspect txt
cat(df$text[1])