sqldf/SQLite中的缩放/平均中心/demean变量?
我试图使用R中的sqldf/SQLite中的缩放/平均中心/demean变量?,sql,r,sqlite,plyr,sqldf,Sql,R,Sqlite,Plyr,Sqldf,我试图使用R中的sqldf包,通过3个维度来表示一个变量:年、月和地区 下面是我使用plyr软件包想要做的事情: ## create example data set.seed(145) v = Sys.Date()-seq(1,425) regions = LETTERS[1:6] VAR1_DATA = as.data.frame(expand.grid(v,regions)) VAR1_DATA$VAR1 = rpois(nrow(VAR1_DATA), 4) + runif(nrow(V
sqldf
包,通过3个维度来表示一个变量:年、月和地区
下面是我使用plyr
软件包想要做的事情:
## create example data
set.seed(145)
v = Sys.Date()-seq(1,425)
regions = LETTERS[1:6]
VAR1_DATA = as.data.frame(expand.grid(v,regions))
VAR1_DATA$VAR1 = rpois(nrow(VAR1_DATA), 4) + runif(nrow(VAR1_DATA), 25,35)
names(VAR1_DATA) = c("DATE","REG","VAR1")
## mean center VAR1 by year, month and region using plyr:
lapply(c('chron','plyr'), require, character.only=T)
table1 = cbind(MONTH = months(as.POSIXlt(VAR1_DATA[,'DATE'])),
YEAR = years(as.POSIXlt(VAR1_DATA[,'DATE'])),
VAR1_DATA)
table2 = ddply(table1, c('YEAR','MONTH','REG'), transform, MEAN.V1 = mean(VAR1), DEMEANED.V1 = VAR1 - mean(VAR1))
head(table2)
## MONTH YEAR DATE REG VAR1 MEAN.V1 DEMEANED.V1
## 1 December 2011 2011-12-31 A 30.03605 34.69316 -4.6571064
## 2 December 2011 2011-12-30 A 31.69130 34.69316 -3.0018600
## 3 December 2011 2011-12-29 A 35.46342 34.69316 0.7702634
## 4 December 2011 2011-12-28 A 32.09727 34.69316 -2.5958876
## 5 December 2011 2011-12-27 A 36.51519 34.69316 1.8220386
## 6 December 2011 2011-12-26 A 35.65338 34.69316 0.9602236
现在我想使用SQLite/SQL复制上面的结果。下面是我目前试图实现这一点的SQLite代码(警告:下面的代码不起作用!)。我把它放在这里是为了说明我的SQLish思维过程:
require(sqldf)
sqldf("
SELECT
strftime('%m', t1.DATE) AS 'MONTH',
strftime('%Y', t1.DATE) AS 'YEAR',
t1.DATE,
t1.REG,
t1.VAR1,
t2.MVAR1 AS 'MO_AVG_VAR1',
(t1.VAR1-t2.MVAR1) AS 'DEMEANED_VAR1',
FROM VAR1_DATA AS t1,
(
SELECT
DATE,
REG,
avg(VAR1) AS MVAR1,
FROM VAR1_DATA
GROUP BY strftime('%Y', DATE), strftime('%m', DATE), REG
) AS t2
WHERE t1.REGION = t2.REGION
AND t1.DATE = t2.DATE
GROUP BY strftime('%Y', t1.DATE), strftime('%m', t1.DATE), t1.REGION
ORDER BY YEAR, MONTH, REG
;")
问题:在SQLite/sqldf中是否可以进行此计算——如果可以,如何计算?如果答案还提供了(稍加修改的?)常规SQL(即mySQL、PostgreSQL等)实现,则会获得额外的分数
非常感谢 试试这个:
## set order so we can compare it later
table2 <- table2[order(table2$DATE, table2$REG), ]
## use a single SQL statement
s1 <- "select
rowid,
*,
strftime('%Y-%m', DATE * 3600 * 24, 'unixepoch') AS 'YM'
from VAR1_DATA"
s2a <- "select a.*,
avg(b.VAR1) 'MEAN.V1',
a.VAR1 - avg(b.VAR1) 'DEMEANED.V1'
from ($s1) a, ($s1) b using (YM, REG)
group by a.rowid
order by a.DATE, a.REG"
# substitute s1 into s2a giving the single sql statement:
# cat(fn$identity(s2a), "\n")
tab2 <- fn$sqldf(s2a)
# ensure they compare to the plyr solution
all.equal(table2$MEAN.V1, tab2$MEAN.V1) # TRUE
all.equal(table2$DEMEANED.V1, tab2$DEMEANED.V1) # TRUE
##设置顺序,以便稍后比较
表2感谢您的回复!一条评论:在你的答案的输出中,只有一个单一的观察值,用于年-月-区域的每个组合;然而,我希望在结果中包含t1的所有观察结果。这样,每个观察都有一个贬损的VAR1。看起来tab2
会产生与原始答案相同的结果(只有90行)。所需的结果应该有2550行(正如从plyr
输出的table2
一样)。退一步说,我真正想做的是在一个sqldf
语句中完成整个贬低。
# s1 is as above
tab1 <- sqldf(s1)
s2b <- "select a.*,
avg(b.VAR1) 'MEAN.V1',
a.VAR1 - avg(b.VAR1) 'DEMEANED.V1'
from tab1 a, tab1 b using (YM, REG)
group by a.rowid
order by a.DATE, a.REG"
tab2 <- sqldf(s2b)
# ensure they compare to the plyr solution
all.equal(table2$MEAN.V1, tab2$MEAN.V1) # TRUE
all.equal(table2$DEMEANED.V1, tab2$DEMEANED.V1) # TRUE