在R中改进从google获取股票新闻数据的函数
我已经编写了一个函数来获取和解析来自谷歌的关于给定股票符号的新闻数据,但我相信有一些方法可以改进。首先,我的函数返回一个GMT时区的对象,而不是用户的当前时区,如果传递的数字大于299,它就会失败(可能是因为谷歌每只股票只返回300个故事)。这在某种程度上取决于堆栈溢出,并且严重依赖于 tl;dr:我如何改进这一功能在R中改进从google获取股票新闻数据的函数,r,timezone,xts,quantmod,google-finance,R,Timezone,Xts,Quantmod,Google Finance,我已经编写了一个函数来获取和解析来自谷歌的关于给定股票符号的新闻数据,但我相信有一些方法可以改进。首先,我的函数返回一个GMT时区的对象,而不是用户的当前时区,如果传递的数字大于299,它就会失败(可能是因为谷歌每只股票只返回300个故事)。这在某种程度上取决于堆栈溢出,并且严重依赖于 tl;dr:我如何改进这一功能 getNews <- function(symbol, number){ # Warn about length if (number>300) {
getNews <- function(symbol, number){
# Warn about length
if (number>300) {
warning("May only get 300 stories from google")
}
# load libraries
require(XML); require(plyr); require(stringr); require(lubridate);
require(xts); require(RDSTK)
# construct url to news feed rss and encode it correctly
url.b1 = 'http://www.google.com/finance/company_news?q='
url = paste(url.b1, symbol, '&output=rss', "&start=", 1,
"&num=", number, sep = '')
url = URLencode(url)
# parse xml tree, get item nodes, extract data and return data frame
doc = xmlTreeParse(url, useInternalNodes = TRUE)
nodes = getNodeSet(doc, "//item")
mydf = ldply(nodes, as.data.frame(xmlToList))
# clean up names of data frame
names(mydf) = str_replace_all(names(mydf), "value\\.", "")
# convert pubDate to date-time object and convert time zone
pubDate = strptime(mydf$pubDate,
format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
pubDate = with_tz(pubDate, tz = 'America/New_york')
mydf$pubDate = NULL
#Parse the description field
mydf$description <- as.character(mydf$description)
parseDescription <- function(x) {
out <- html2text(x)$text
out <- strsplit(out,'\n|--')[[1]]
#Find Lead
TextLength <- sapply(out,nchar)
Lead <- out[TextLength==max(TextLength)]
#Find Site
Site <- out[3]
#Return cleaned fields
out <- c(Site,Lead)
names(out) <- c('Site','Lead')
out
}
description <- lapply(mydf$description,parseDescription)
description <- do.call(rbind,description)
mydf <- cbind(mydf,description)
#Format as XTS object
mydf = xts(mydf,order.by=pubDate)
# drop Extra attributes that we don't use yet
mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
return(mydf)
}
getNews 300){
警告(“可能只能从谷歌获得300篇报道”)
}
#加载库
需要(XML);需要(plyr);要求(stringr);要求(润滑);
需要(xts);要求(RDSTK)
#构造新闻提要rss的url并正确编码
url.b1=http://www.google.com/finance/company_news?q='
url=paste(url.b1,符号“&output=rss”,“&start=”,1,
“&num=”,number,sep=”)
url=url编码(url)
#解析xml树,获取项目节点,提取数据并返回数据帧
doc=xmlTreeParse(url,useInternalNodes=TRUE)
节点=getNodeSet(文档,//item)
mydf=ldply(节点,如.data.frame(xmlToList))
#清除数据帧的名称
名称(mydf)=str_replace_all(名称(mydf),“值\\.\”,“”)
#将pubDate转换为日期时间对象并转换时区
pubDate=strtime(mydf$pubDate,
格式=“%a,%d%b%Y%H:%M:%S”,tz=“GMT”)
pubDate=带_-tz(pubDate,tz=‘美国/纽约’)
mydf$pubDate=NULL
#解析描述字段
mydf$description这里是getNews
函数的一个更短(可能更高效)的版本
getNews2 <- function(symbol, number){
# load libraries
require(XML); require(plyr); require(stringr); require(lubridate);
# construct url to news feed rss and encode it correctly
url.b1 = 'http://www.google.com/finance/company_news?q='
url = paste(url.b1, symbol, '&output=rss', "&start=", 1,
"&num=", number, sep = '')
url = URLencode(url)
# parse xml tree, get item nodes, extract data and return data frame
doc = xmlTreeParse(url, useInternalNodes = T);
nodes = getNodeSet(doc, "//item");
mydf = ldply(nodes, as.data.frame(xmlToList))
# clean up names of data frame
names(mydf) = str_replace_all(names(mydf), "value\\.", "")
# convert pubDate to date-time object and convert time zone
mydf$pubDate = strptime(mydf$pubDate,
format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')
# drop guid.text and guid..attrs
mydf$guid.text = mydf$guid..attrs = NULL
return(mydf)
}
getNews2太好了,谢谢!当您更新代码以分析描述列时,您认为它可以返回xts对象吗?我还试图编写一个正则表达式以从链接中提取站点名称。@Zach要转换为xts对象,只需编写myxts=as.xts(mydf[,names(mydf)!=“pubDate”],order.by=mydf$pubDate)
对分析“描述”字段有什么建议吗?我想删除所有多余的东西,只剩下文章的第一行了?我相信谷歌金融的API已经改变了。