Python,如何对多变量时间序列进行聚类?
我有一个玩具时间序列数据帧,格式如下:Python,如何对多变量时间序列进行聚类?,python,pandas,cluster-analysis,data-science,hierarchical-clustering,Python,Pandas,Cluster Analysis,Data Science,Hierarchical Clustering,我有一个玩具时间序列数据帧,格式如下: >>> df dtime dev sw1 sw2 0 2020-01-01 00:00:00 A1 5.496714 5.593792 1 2020-01-01 00:15:00 A1 5.417291 6.385936 2 2020-01-01 00:30:00 A1 6.758800 6.056747 3 2020-01-01 00:4
>>> df
dtime dev sw1 sw2
0 2020-01-01 00:00:00 A1 5.496714 5.593792
1 2020-01-01 00:15:00 A1 5.417291 6.385936
2 2020-01-01 00:30:00 A1 6.758800 6.056747
3 2020-01-01 00:45:00 A1 8.189697 7.862034
4 2020-01-01 01:00:00 A1 6.988069 6.595961
5 2020-01-01 01:15:00 A1 7.543641 6.080126
6 2020-01-01 01:30:00 A1 9.912546 10.208666
7 2020-01-01 01:45:00 A1 9.656324 9.917379
8 2020-01-01 02:00:00 A1 8.974970 8.980084
9 2020-01-01 02:15:00 A1 10.542560 10.307973
0 2020-01-01 00:00:00 B1 4.536582 3.121212
1 2020-01-01 00:15:00 B1 5.089826 4.669180
2 2020-01-01 00:30:00 B1 6.353073 6.010359
3 2020-01-01 00:45:00 B1 4.753386 3.951109
4 2020-01-01 01:00:00 B1 5.497304 5.336019
.. ... .. ... ...
5 2020-01-01 01:15:00 H3 3.044125 3.456906
6 2020-01-01 01:30:00 H3 1.753714 2.575774
7 2020-01-01 01:45:00 H3 0.812104 2.708897
8 2020-01-01 02:00:00 H3 0.647316 0.401928
9 2020-01-01 02:15:00 H3 -1.987569 -2.741305
0 2020-01-01 00:00:00 I3 4.780328 3.890814
1 2020-01-01 00:15:00 I3 4.801557 3.985747
2 2020-01-01 00:30:00 I3 5.366783 5.289681
3 2020-01-01 00:45:00 I3 2.815063 3.156215
4 2020-01-01 01:00:00 I3 1.969284 2.245975
5 2020-01-01 01:15:00 I3 1.720465 2.547648
6 2020-01-01 01:30:00 I3 2.582069 2.595071
7 2020-01-01 01:45:00 I3 1.439862 2.893396
8 2020-01-01 02:00:00 I3 0.025795 -0.238861
9 2020-01-01 02:15:00 I3 0.513267 3.233437
[90 rows x 4 columns]
每行显示d(ate)时间、设备和设备两个开关的位置。
我必须根据开关的位置对时间序列进行聚类。
因此,类似的设置开关(最小距离)应该形成一个集群
数据帧的创建方式是:设备A1、B1和C1应形成集群1,设备D2、E2、F2应形成集群2,设备G3、H3、I3应形成集群3
如何构建时间序列的代码如下所示:
import pandas as pd
import numpy as np
nper = 10 # number of periods
dtime_fr = '2020-01-01' # datetime from
freq = '15T'
dtime_range = pd.date_range(dtime_fr, periods=nper, freq=freq) # timestamps creation
df = pd.DataFrame()
# reising baseline
baseline_start = 5
baseline_stop = 10
baseline_linspace = np.linspace(baseline_start, baseline_stop, nper)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'A1', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'B1', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'C1', 'sw1': ts, 'city':'C1'})], sort=False)
# steady baseline
baseline_start = 5
baseline_stop = 5
baseline_linspace = np.linspace(baseline_start, baseline_stop, nper)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'D2', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'E2', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'F2', 'sw1': ts, 'city':'C1'})], sort=False)
# falling baseline
baseline_start = 5
baseline_stop = 0
baseline_linspace = np.linspace(baseline_start, baseline_stop, nper)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'G3', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'H3', 'sw1': ts, 'city':'C1'})], sort=False)
ts = baseline_linspace + np.random.normal(size=nper)
df = pd.concat([df, pd.DataFrame({'dtime' : dtime_range, 'dev': 'I3', 'sw1': ts, 'city':'C1'})], sort=False)
df.insert(3, 'sw2', df.sw1 + np.random.normal(size=len(df)))
df
如果我只有一个开关(sw1),我将按照以下方式进行集群:
df_sw1 = df.pivot(index='dev', values='sw1', columns='dtime')
dist = pdist(df_sw1, metric='euclidean')
Z = scipy.cluster.hierarchy.linkage(dist)
fig, ax = plt.subplots(figsize=(20,15))
scipy.cluster.hierarchy.dendrogram(Z, labels=df_sw1.index, orientation='top');
from sklearn.cluster import AgglomerativeClustering
hclust = AgglomerativeClustering(n_clusters=3)
hclust.fit(df_sw1)
hclust.labels_
array([0, 0, 0, 2, 2, 2, 1, 1, 1], dtype=int64)
在考虑两个开关的情况下,如果两个设备相同,SW1将(几乎)与其他设备的SW1具有相同的值,同时SW2将(几乎)与其他设备的SW2相同。 我不知道该怎么做
为了计算pdist,数据帧应该具有什么格式? 如果必须考虑两个交换机,如何进行聚类?< /P>