R 如何在不同频率的时间序列之间进行相关
我在36分钟内每分钟测量一次室温,在同一时间段内每秒测量32次皮肤温度。我重复了35次实验,标记为(ID)。我需要能够查看相关性,但是,样本大小不等 数据: 我有一个data.frameR 如何在不同频率的时间序列之间进行相关,r,time-series,R,Time Series,我在36分钟内每分钟测量一次室温,在同一时间段内每秒测量32次皮肤温度。我重复了35次实验,标记为(ID)。我需要能够查看相关性,但是,样本大小不等 数据: 我有一个data.framedf1每分钟测量一次室温,另一个data.framedf2每秒测量32次皮肤温度。我有36分钟的数据。此外,还有一个名为ID的列,它显示了实验编号(1-35),但我不知道如何在下面的示例数据中表示这一点。所以从技术上讲,我在寻找基于ID的SkinTemp和RoomTemp的相关性 df1 <- da
df1
每分钟测量一次室温,另一个data.framedf2
每秒测量32次皮肤温度。我有36分钟的数据。此外,还有一个名为ID的列,它显示了实验编号(1-35),但我不知道如何在下面的示例数据中表示这一点。所以从技术上讲,我在寻找基于ID的SkinTemp和RoomTemp的相关性
df1 <- data.frame(
roomTemp = rnorm(1*36),
)
df2 <- data.frame(
skinTemp = rnorm(32*60*36),
)
df1下面,我提供了一个如何实现这种关联的最小示例
你可以在下面查看我的评论,但实际上我所做的是为每个室温观察时间创建容器(或“桶”)。然后,我将皮肤温度观测值(其数量远远超过室温观测值)汇总到相应的容器中。因此,因为每36*60*32个皮肤温度观测值就有一个室温观测值,所以前36*60*32个皮肤温度观测值会被放入“1”箱中。该过程从那里继续进行,从[36*60*32,36*60*32*2]获得的皮肤温度观测值被卷进料仓“2”,依此类推
library(lubridate)
library(dplyr)
# create the times of our observations
time.room.temp <- seq.POSIXt(from = as.POSIXct('02/20/2017', format = '%m/%d/%Y'), to = as.POSIXct('02/21/2017', format = '%m/%d/%Y'), by = 36*60)
time.skin.temp <- seq.POSIXt(from = as.POSIXct('02/20/2017', format = '%m/%d/%Y'), to = as.POSIXct('02/21/2017', format = '%m/%d/%Y'), by = 1/32)
n.obs.room.temp <- length(room.temp)
n.obs.skin.temp <- length(skin.temp)
# create some "actual" temperature data
obs.room.temp <- rnorm(n.obs.room.temp, mean = 60, sd = 10)
obs.skin.temp <- rnorm(n.obs.skin.temp, mean = 95, sd = 5)
room.temp.df <- data.frame('room temp' = obs.room.temp, 'time' = time.room.temp)
skin.temp.df <- data.frame('skin temp' = obs.skin.temp, 'time' = time.skin.temp)
# Every 32 indices, seconds is incremented by one.. So our modulus calculuation should be every
# time the index evenly divides 36*60*32... there are 69120 skin-temp observations for every room-temp observation
# So we can effectively "bin" the different seconds for which we observed skin temperatures in order to create a mean temperature by bin,
# i.e. a mean skin temperature for every time at which room temp was recorded
bins <- cut(1:n.obs.skin.temp, seq(0, n.obs.skin.temp, 36*60*32), labels = 1:40)
skin.temp.df$bins <- bins
# Now, we can effectively group skin temperature observations by room temperature observations, and get the average (or median, if you like)
# temperature for each bin
shorter.skin.temp.df <- skin.temp.df %>%
group_by(bins) %>%
summarise(average.skin.temp = mean(skin.temp))
# Now we can get the correlation between the two types of temperatures!
cor(room.temp.df$room.temp, shorter.skin.temp.df$average.skin.temp)
因此,您可以放心地知道,对于每个唯一的室温观察时间,都有一个对应的唯一的皮肤温度观察时间箱。滚动连接或插值可能有助于为测量皮肤温度的时间输入房间温度。下面是两者的例子。第一部分是处理多个ID的更新,然后是单个ID的原始答案
更新:处理多个ID的新版本
此更新解决了具有多个ID的数据的情况,其中我们希望对每个ID分别进行插值或进行滚动联接
library(data.table)
library(reshape2)
library(dplyr)
library(purrr)
library(ggplot2)
theme_set(theme_classic(base_size=16))
首先,我们将为两个单独的ID创建伪自相关数据:
set.seed(395)
df1 <- data.frame(roomTemp = c(cumsum(rnorm(1*36)), cumsum(rnorm(1*36))),
ID = rep(c("A","B"), each=36))
df2 <- data.frame(skinTemp = c(cumsum(rnorm(32*60*36,0,0.01)),
cumsum(rnorm(32*60*36,0,0.01))),
ID = rep(c("A","B"), each=32*60*36))
将数据帧转换为数据表。这一次,除了time
之外,我们还将ID
设为一个键列,以便对每个ID
分别进行滚动联接
# Convert data frames to data tables
setDT(df1)
setDT(df2)
# Make ID and time key columns in both data frames (for joining)
setkey(df1, ID, time)
setkey(df2, ID, time)
# Rolling join roomTemp to nearest time value of skinTemp
df2 = df1[df2, roll="nearest"]
# Rename rolling joined room temperature column
names(df2)[grep("roomTemp", names(df2))] = "roomTempRoll"
# Plot so we can see what the rolling joined room temperature and
# interpolated room temperature look like
ggplot(melt(df2, id.var=c("ID", "time")), aes(time, value, colour=variable)) +
geom_line(size=0.7) +
geom_point(data=df1, aes(time, roomTemp), colour="black") +
facet_grid(ID ~ .)
为了通过ID
添加插入的roomTemp
,我使用了purr
包中的map\u df
map\u df
分别对每个ID
进行操作<代码>近似值
负责插值。在最初的回答中,我首先使用了approxfun
创建了一个近似函数,但这里我只是在一个步骤中直接完成了插值map_df
返回一个数据帧,但我们只需要y
列,它的插值值为roomTemp
,因此我提取了dplyr
函数链末端的值,并将它们分配给df2
中的roomtempenp
# Add interpolated room temperature by ID
df2$roomTempInterp = unique(df2$ID) %>%
map_df(~ approx(df1$time[df1$ID==.x], df1$roomTemp[df1$ID==.x],
xout=df2$time[df2$ID==.x]), .id="ID") %>% .$y
在下图中,我们将ID
分面,这样我们就可以分别看到每个ID
的输入温度值
# Convert data frames to data tables
setDT(df1)
setDT(df2)
# Make ID and time key columns in both data frames (for joining)
setkey(df1, ID, time)
setkey(df2, ID, time)
# Rolling join roomTemp to nearest time value of skinTemp
df2 = df1[df2, roll="nearest"]
# Rename rolling joined room temperature column
names(df2)[grep("roomTemp", names(df2))] = "roomTempRoll"
# Plot so we can see what the rolling joined room temperature and
# interpolated room temperature look like
ggplot(melt(df2, id.var=c("ID", "time")), aes(time, value, colour=variable)) +
geom_line(size=0.7) +
geom_point(data=df1, aes(time, roomTemp), colour="black") +
facet_grid(ID ~ .)
这里有一种通过ID
获得相关性的方法:
df2 %>% group_by(ID) %>%
summarise(r_interp = cor(skinTemp, roomTempInterp, use="pairwise.complete.obs"),
r_roll = cor(skinTemp, roomTempRoll, use="pairwise.complete.obs"))
原始答案
首先,我修改了示例数据帧以添加一些自相关,因为这似乎更接近您的真实实验,并且使可视化更容易
library(data.table)
library(reshape2)
library(dplyr)
library(ggplot2)
theme_set(theme_classic(base_size=16))
# Fake data with autocorrelation
set.seed(395)
df1 <- data.frame(roomTemp = cumsum(rnorm(1*36)))
df2 <- data.frame(skinTemp = cumsum(rnorm(32*60*36,0,0.01)))
对于插值,我们需要一个函数,当在室温测量之间测量皮肤温度时,该函数将插值室温approxfun
在点之间执行线性插值。也可以以类似的方式使用splinefun
使用样条曲线进行插值
# Function to interpolate room temperature between measurements
roomTempInterp = approxfun(df1$time, df1$roomTemp)
将数据帧转换为数据表,以便使用data.table
的滚动连接功能
# Convert data frames to data tables
setDT(df1)
setDT(df2)
# Make time a key column in both data frames (for joining)
setkey(df1, time)
setkey(df2, time)
现在执行滚动连接到最近的时间值
# Rolling join roomTemp to nearest time value of skinTemp
df2 = df1[df2, roll="nearest"]
# Rename rolling joined room temperature column
names(df2)[grep("roomTemp", names(df2))] = "roomTempRoll"
将df1
中的原始roomTemp
测量值合并到df2
中
df2 = df1[df2, ] # Equivalent to dplyr: df2 = left_join(df2, df1)
使用上面创建的函数添加插值的室温
# Add interpolated room temperature
df2$roomTempInterp = roomTempInterp(df2$time)
插值方法对我来说似乎更现实,特别是如果我们可以假设roomTemp
在测量之间变化相对平稳且单调。以下是df2
的前10行,其中包括原始df2
数据加上新的roomTempRoll
和roomTempInterp
列以及df1
的原始roomTemp
测量值。现在,您可以使用此数据框来评估roomTemp
和skintmp
之间的相关性和其他关系
下面是一个图,您可以看到滚动连接和插值的样子。黑点标记原始roomTemp
测量值
ggplot(melt(df2 %>% select(-roomTemp), id.var="time"), aes(time, value, colour=variable)) +
geom_line(size=1) +
geom_point(data=df2, aes(time, roomTemp), colour="black")
这是一个非常好的答案,谢谢。我发现每次实验中,我的皮肤和房间尺寸都不相等。也就是说,在每次实验中温度计没有同时关闭。所以我添加了一个名为ID的列,用于皮肤温度和室温,它将它们与每个实验联系起来。有没有一种方法可以根据ID对观察结果进行分类?IIUC,您已经为数据的ID设置了“分类箱”。您可能可以将上面的groupby
扩展为groupby(ID,bins)
。。。但是您需要确保每个ID
的bin
对应于正确的时间间隔。例如,对于ID=1
,您可能有时间间隔箱[[t0,t1],[t1,t2],…],而对于ID=2
您可能有[[t1,t2],[t3,t4],…]…因此,如果您尝试在ID 1和ID 2之间进行相关性比较,您可能会错过第一个间隔。可能需要考虑一些事情。再次阅读您的评论,您会发现
df2 = df1[df2, ] # Equivalent to dplyr: df2 = left_join(df2, df1)
# Add interpolated room temperature
df2$roomTempInterp = roomTempInterp(df2$time)
roomTemp time roomTempRoll skinTemp roomTempInterp
1: -1.21529 0.00000 -1.21529 -0.006511475 -1.215290
2: NA 0.03125 -1.21529 -0.014058076 -1.215531
3: NA 0.06250 -1.21529 -0.017741690 -1.215773
4: NA 0.09375 -1.21529 -0.030211177 -1.216014
5: NA 0.12500 -1.21529 -0.027105225 -1.216255
6: NA 0.15625 -1.21529 -0.035784295 -1.216497
7: NA 0.18750 -1.21529 -0.031319748 -1.216738
8: NA 0.21875 -1.21529 -0.033758959 -1.216979
9: NA 0.25000 -1.21529 -0.040667384 -1.217220
10: NA 0.28125 -1.21529 -0.026291442 -1.217462
ggplot(melt(df2 %>% select(-roomTemp), id.var="time"), aes(time, value, colour=variable)) +
geom_line(size=1) +
geom_point(data=df2, aes(time, roomTemp), colour="black")