Python 寻找循环数据簇的最小值和最大值_Python_Machine Learning_Cluster Analysis

Python 寻找循环数据簇的最小值和最大值

python machine-learning

Python 寻找循环数据簇的最小值和最大值,python,machine-learning,cluster-analysis,Python,Machine Learning,Cluster Analysis,考虑到集群超出了值范围的限制，如何确定循环数据的集群最小值和最大值（此处为0到24的范围）查看蓝色集群，我想确定值22和2作为集群的边界。哪种算法可以解决这个问题我找到了解决这个问题的办法。假设数据采用以下格式： #!/usr/bin/env python3 import numpy as np data = np.array([0, 1, 2, 12, 13, 14, 15, 21, 22, 23]) labels = np.array([0, 0, 0, 1, 1, 1, 1, 0

考虑到集群超出了值范围的限制，如何确定循环数据的集群最小值和最大值（此处为0到24的范围）

查看蓝色集群，我想确定值22和2作为集群的边界。哪种算法可以解决这个问题

我找到了解决这个问题的办法。假设数据采用以下格式：

#!/usr/bin/env python3

import numpy as np

data = np.array([0, 1, 2, 12, 13, 14, 15, 21, 22, 23])
labels = np.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
bounds = get_cluster_bounds(data, labels)
print(bounds) # {0: array([21,  2]), 1: array([12, 15])}

您可以在此处找到该函数：

#!/usr/bin/env python3

import numpy as np


def get_cluster_bounds(data: np.ndarray, labels: np.ndarray) -> dict:
    """
    There are five ways in which the points of the cluster can be cyclically
    considered. The points to be determined are marked with an arrow.

    In the first case, the cluster data is distributed beyond the edge of
    the cycle:
         ↓B           ↓A
    |#####____________#####|

    In the second case, the data lies exactly at the beginning of the value
    range, but without exceeding it.
    ↓A        ↓B
    |##########____________|

    In the third case, the data lies exactly at the end of the value
    range, but without exceeding it.
                 ↓A       ↓B
    |____________##########|

    In the fourth, the data lies within the value range
    without touching a border.
            ↓A       ↓B
    |_______##########_____|

    In the fifth and simplest case, the data lies in the entire area without
    another label existing.
     ↓A                   ↓B
    |######################|

    Args:
        data:      (n, 1) numpy array containing all data points.
        labels:    (n, 1) numpy array containing all data labels.

    Returns:
        bounds:   A dictionary whose key is the index of the cluster and
                  whose value specifies the start and end point of the
                  cluster.
    """

    # Sort the data in ascending order.
    shuffle = data.argsort()
    data = data[shuffle]
    labels = labels[shuffle]

    # Get the number of unique clusters.
    labels_unique = np.unique(labels)
    num_clusters = labels_unique.size

    bounds = {}

    for c_index in range(num_clusters):
        mask = labels == c_index
        # Case 1 or 5
        if mask[0] and mask[-1]:
            # Case 5
            if np.all(mask):
                start = data[0]
                end = data[-1]
            # Case 1
            else:
                edges = np.where(np.invert(mask))[0]
                start = data[edges[-1] + 1]
                end = data[edges[0] - 1]

        # Case 2
        elif mask[0] and not mask[-1]:
            edges = np.where(np.invert(mask))[0]
            start = data[0]
            end = data[edges[0] - 1]

        # Case 3
        elif not mask[0] and mask[-1]:
            edges = np.where(np.invert(mask))[0]
            start = data[edges[-1] + 1]
            end = data[-1]

        # Case 4
        elif not mask[0] and not mask[-1]:
            edges = np.where(mask)[0]
            start = data[edges[0]]
            end = data[edges[-1]]

        else:
            raise ValueError('This should not happen.')

        bounds[c_index] = np.array([start, end])

    return bounds

如何定义蓝色集群。聚类意味着使用某种标准来收集相似的样本。对我来说，blue cluster的两个部分不应该聚集在一起。它是循环数据，在这个示例中是一天中的时间。我已经编写了一个集群算法来返回这个结果。计划在午夜对数据进行汇总。