如何在R中对具有不同大小向量的时间序列数据进行聚类_R_Machine Learning_Time Series_Cluster Analysis_Hierarchical Clustering

如何在R中对具有不同大小向量的时间序列数据进行聚类

r machine-learning

如何在R中对具有不同大小向量的时间序列数据进行聚类,r,machine-learning,time-series,cluster-analysis,hierarchical-clustering,R,Machine Learning,Time Series,Cluster Analysis,Hierarchical Clustering,我感兴趣的是将我拥有的时间序列数据分为6组。我所拥有的数据将每一行表示为单个时间序列数据，像这样，我有大约800到1000个时间序列数据。但每个时间序列数据具有不同的长度，例如，时间序列数据“1”具有102个值；时间序列数据“2”有56个值；时间序列数据“3”有180个值，以此类推。。。。我在excel中的示例数据如下所示： A B C D E F G H I J K L M N O P Q

我感兴趣的是将我拥有的时间序列数据分为6组。我所拥有的数据将每一行表示为单个时间序列数据，像这样，我有大约800到1000个时间序列数据。但每个时间序列数据具有不同的长度，例如，时间序列数据“1”具有102个值；时间序列数据“2”有56个值；时间序列数据“3”有180个值，以此类推。。。。我在excel中的示例数据如下所示：

  A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T
1 7.4  8.1  8.5  9.1  9.6  10.2 10.7 11.3 11.9
2 7.3  7.6  7.9  8.2  8.5  8.8  9.1  9.4  9.7  10.1 10.4 10.7 11.5
3 7.6  8.1  8.6  9.1  9.6  10.2 10.7 11.8
4 7.4  7.8  8.4  8.9  9.4  10.0 10.5 11.1 11.6 12.3 12.8 13.4 13.5 13.9 14.4 14.9 15.4

我不知道如何处理不等长度的时间序列数据进行聚类

如何计算两个时间序列数据之间的dtw

对于长度相等的时间序列数据，它使用以下代码

library(dtw)

dm <- dist(sample1, method= "DTW")
hc <- hclust(dm, method="average")

 plot(hc, labels=Labels, 
 cex = 0.5, 
 hang = -1, 
 col = 'blue',
 main="cluster dendrogram")

rect.hclust(hc, k = 6) # displays the groups in the plot

库（dtw）
dm我不太了解如何对时间序列进行聚类，但我尝试了OP的例子中提供的方法，该方法适用于不等长的时间序列，似乎效果很好。。。而且它应该：根据dtw
包的作者所说
该函数执行动态时间扭曲（DTW）并计算两个时间序列x和y之间的最佳对齐（以数字向量形式给出）。“最佳”对齐使对齐元素之间的距离之和最小化x和y的长度可能不同
首先加载库
library(dtw); library(ggplot2)

然后创建时间序列
xlist <- list(x1 = c(7.4, 8.1, 8.5, 9.1, 9.6, 10.2, 10.7, 11.3, 11.9)
              x2 = c(7.3, 7.6, 7.9, 8.2, 8.5, 8.8, 9.1, 9.4, 9.7, 10.1,
                     10.4, 10.7, 11.5)
              x3 = c(7.6, 8.1, 8.6, 9.1, 9.6, 10.2, 10.7, 11.8)
              x4 = c(7.4, 7.8, 8.4, 8.9, 9.4, 10, 10.5, 11.1, 11.6, 12.3,
                     12.8, 13.4, 13.5, 13.9, 14.4, 14.9, 15.4))

xlist使用末尾注释中重复显示的数据，我们可以将每个数据拟合成一条直线，然后对斜率进行聚类
library(Ckmeans.1d.dp) # univariate clustering package

slopes <- coef(lm(t(DF) ~ seq_along(DF)))[2, ]
fm <- Ckmeans.1d.dp(slopes)

# graph the slopes on X axis identifying each and
# coloring each cluster with a different color
plot(fm)
text(slopes, 1, 1:4, adj = 0:-1)

[绘图后继续]

注
以可复制形式输入：
Lines <- "row A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T
1 7.4  8.1  8.5  9.1  9.6  10.2 10.7 11.3 11.9
2 7.3  7.6  7.9  8.2  8.5  8.8  9.1  9.4  9.7  10.1 10.4 10.7 11.5
3 7.6  8.1  8.6  9.1  9.6  10.2 10.7 11.8
4 7.4  7.8  8.4  8.9  9.4  10.0 10.5 11.1 11.6 12.3 12.8 13.4 13.5 13.9 14.4 14.9 15.4"
DF <- read.table(text = Lines, header = TRUE, fill = TRUE)[-1]

Lines您可以尝试发现时间序列中的主题，作为发现相似性的一种手段，也就是说，根据时间序列的整个长度，找到时间序列中相似而非相似的较短片段。然而，如果没有对领域、手头的问题等的深入了解，就很难确定这是否合适
基于群集的算法可以做到这一点。您能详细解释一下语法的作用吗。我有一大组长度不等的时间序列数据（每行*n列是一个数据点/时间序列数据）。我们可以使用所提出的方法对数据进行聚类吗？提前感谢当我用33个可变长度的时间序列测试代码时，我得到以下错误。。。model.frame.default（公式=t（asqw1）~seq_-along（asqw1），drop.unused.levels=TRUE）中存在错误：变量长度不同（为“seq_-along（asqw1）”找到）。。。。（带星号的）语法中的值应该是什么：slopes1将其简化为一个最小的可复制示例，并展示它。我已经给出了问题的答案，请查看itLet us。
library(Ckmeans.1d.dp) # univariate clustering package

slopes <- coef(lm(t(DF) ~ seq_along(DF)))[2, ]
fm <- Ckmeans.1d.dp(slopes)

# graph the slopes on X axis identifying each and
# coloring each cluster with a different color
plot(fm)
text(slopes, 1, 1:4, adj = 0:-1)

library(zoo)

# plot each series with each cluster having a different color
plot(zoo(t(DF)), screen = 1, col = fm$cluster)

Lines <- "row A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T
1 7.4  8.1  8.5  9.1  9.6  10.2 10.7 11.3 11.9
2 7.3  7.6  7.9  8.2  8.5  8.8  9.1  9.4  9.7  10.1 10.4 10.7 11.5
3 7.6  8.1  8.6  9.1  9.6  10.2 10.7 11.8
4 7.4  7.8  8.4  8.9  9.4  10.0 10.5 11.1 11.6 12.3 12.8 13.4 13.5 13.9 14.4 14.9 15.4"
DF <- read.table(text = Lines, header = TRUE, fill = TRUE)[-1]