Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql R的滚动日期范围内唯一值的计数_Sql_R_Data.table_Time Series_Correlated Subquery - Fatal编程技术网

Sql R的滚动日期范围内唯一值的计数

Sql R的滚动日期范围内唯一值的计数,sql,r,data.table,time-series,correlated-subquery,Sql,R,Data.table,Time Series,Correlated Subquery,这个问题已经有了答案,我能够使用sqldf在R中实现这个解决方案。然而,我无法找到一种使用data.table实现它的方法 问题在于计算滚动日期范围内一列的不同值,例如(并直接从链接问题中引用),如果数据如下所示: Date | email -------+---------------- 1/1/12 | test@test.com 1/1/12 | test1@test.com 1/1/12 | test2@test.com 1/2/12 | test1@test.com 1/2/12

这个问题已经有了答案,我能够使用
sqldf
在R中实现这个解决方案。然而,我无法找到一种使用
data.table
实现它的方法

问题在于计算滚动日期范围内一列的不同值,例如(并直接从链接问题中引用),如果数据如下所示:

Date   | email 
-------+----------------
1/1/12 | test@test.com
1/1/12 | test1@test.com
1/1/12 | test2@test.com
1/2/12 | test1@test.com
1/2/12 | test2@test.com
1/3/12 | test@test.com
1/4/12 | test@test.com
1/5/12 | test@test.com
1/5/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test1@test.com
如果我们使用3天的日期周期,那么结果集看起来会像这样

date   | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 2
1/6/12 | 2
下面是使用
数据在R中创建相同数据的代码。表

date <- as.Date(c('2012-01-01','2012-01-01','2012-01-01',
                  '2012-01-02','2012-01-02','2012-01-03',
                  '2012-01-04','2012-01-05','2012-01-05',
                  '2012-01-06','2012-01-06','2012-01-06'))
email <- c('test@test.com', 'test1@test.com','test2@test.com',
           'test1@test.com', 'test2@test.com','test@test.com',
           'test@test.com','test@test.com','test@test.com',
           'test@test.com','test@test.com','test1@test.com')
dt <- data.table(date, email)
编辑2: 根据@jangorecki的要求,以下是基于@MichaelChirico解决方案的一些时间安排:

# The data
> dim(temp)
[1] 2627785       4
> head(temp)
         date category1 category2 itemId
1: 2013-11-08         0         2   1713
2: 2013-11-08         0         2  90485
3: 2013-11-08         0         2  74249
4: 2013-11-08         0         2   2592
5: 2013-11-08         0         2   2592
6: 2013-11-08         0         2    765
> uniqueN(temp$itemId)
[1] 13510
> uniqueN(temp$date)
[1] 127

# Timing for data.table
> system.time(dtTime <- temp[,
+   .(count = temp[.(seq.Date(.BY$date - 6L, .BY$date, "day"), 
+   .BY$category1, .BY$category2 ), uniqueN(itemId), nomatch = 0L]), 
+  by = c("date","category1","category2")])
   user  system elapsed 
  6.913   0.130   6.940 
> 
# Time for sqldf
> system.time(sqlDfTime <- 
+ sqldf(c("create index ldx on temp(date, category1, category2)",
+ "SELECT date, category1, category2,
+ (SELECT count(DISTINCT itemId)
+   FROM   temp
+   WHERE category1 = t.category1 AND category2 = t.category2 AND
+   date BETWEEN t.date - 6 AND t.date 
+   ) AS numItems
+ FROM temp t
+ GROUP BY date, category1, category2
+ ORDER BY 1;"))
   user  system elapsed 
 87.225   0.098  87.295 
#数据
>变光(温度)
[1] 2627785       4
>水头(温度)
日期类别1类别2项目ID
1: 2013-11-08         0         2   1713
2: 2013-11-08         0         2  90485
3: 2013-11-08         0         2  74249
4: 2013-11-08         0         2   2592
5: 2013-11-08         0         2   2592
6: 2013-11-08         0         2    765
>uniqueN(临时$itemId)
[1] 13510
>uniqueN(临时$日期)
[1] 127
#数据表的计时
>系统时间(dtTime)
#sqldf的时间到了

>system.time(sqlDfTime利用
data.table
的新的非equijoin特性,下面是一些有效的方法

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
   on = .(date >= date2, date<=date3), 
   allow.cartesian = TRUE
   ][ , .(count = uniqueN(email)), 
      by = .(date = date + 2)]
#          date V1
# 1: 2011-12-30  3
# 2: 2011-12-31  3
# 3: 2012-01-01  3
# 4: 2012-01-02  3
# 5: 2012-01-03  1
# 6: 2012-01-04  2

使用最近在of
data.table v1.9.7
中实施的
非等连接功能,可以按如下方式完成:

dt[.(date3=unique(dt$date2)), .(count=uniqueN(email)), on=.(date>=date3, date2<=date3), by=.EACHI]
#          date      date2 count
# 1: 2011-12-30 2011-12-30     3
# 2: 2011-12-31 2011-12-31     3
# 3: 2012-01-01 2012-01-01     3
# 4: 2012-01-02 2012-01-02     3
# 5: 2012-01-03 2012-01-03     1
# 6: 2012-01-04 2012-01-04     2

dt[(date3=unique(dt$date2)),(count=uniqueN(email)),on=(date>=date3,date2嗨,你能解释一下不同的电子邮件数量吗,那样会有帮助的。谢谢。我想2012年1月5日应该有count=2。如果你只想要每天唯一电子邮件的数量,那么
dt[,length(unique(email)),by=date]怎么样
?@R.S.这似乎不是理想的输出。OP似乎希望在今天、昨天和前一天
之间有不同的电子邮件。
uniqueN
也优于
length(unique())
。非等联接目前正在data.table中实现。一旦完全完成,就应该能够做到:
dt[。(Date=unique(Date)),uniqueN(email),by=.EACHI,on=。(Date-2L重新清理,
Date
字段来自
i
arg,这是基本R一致性意外行为,您可以使用
x.Date
x
获取
日期。
setkey(dt, date)

dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
                   uniqueN(email), nomatch = 0L]), by = date]
dt[.(date3=unique(dt$date2)), .(count=uniqueN(email)), on=.(date>=date3, date2<=date3), by=.EACHI]
#          date      date2 count
# 1: 2011-12-30 2011-12-30     3
# 2: 2011-12-31 2011-12-31     3
# 3: 2012-01-01 2012-01-01     3
# 4: 2012-01-02 2012-01-02     3
# 5: 2012-01-03 2012-01-03     1
# 6: 2012-01-04 2012-01-04     2