Flutter GridSearchCV如何用于集群（MeanShift或DBSCAN）？_Flutter_Scikit Learn_Cluster Analysis

Flutter GridSearchCV如何用于集群（MeanShift或DBSCAN）？

flutter scikit-learn

Flutter GridSearchCV如何用于集群（MeanShift或DBSCAN）？,flutter,scikit-learn,cluster-analysis,Flutter,Scikit Learn,Cluster Analysis,我正在尝试使用scikit-learn对一些文本文档进行聚类。我正在试用DBSCAN和MeanShift，并想确定哪些超参数（例如MeanShift的带宽、DBSCAN的eps）最适合我使用的数据类型（新闻文章）我有一些由预先标记的集群组成的测试数据。我一直在尝试使用scikit learn的GridSearchCV，但不了解如何（或如果可以）应用于这种情况，因为它需要分割测试数据，但我希望对整个数据集运行评估，并将结果与预先标记的数据进行比较我一直在尝试指定一个评分函数，将估计器的标签与真

我正在尝试使用

scikit-learn

对一些文本文档进行聚类。我正在试用DBSCAN和MeanShift，并想确定哪些超参数（例如MeanShift的

带宽

、DBSCAN的

eps

）最适合我使用的数据类型（新闻文章）

我有一些由预先标记的集群组成的测试数据。我一直在尝试使用

scikit learn

的

GridSearchCV

，但不了解如何（或如果可以）应用于这种情况，因为它需要分割测试数据，但我希望对整个数据集运行评估，并将结果与预先标记的数据进行比较

我一直在尝试指定一个评分函数，将估计器的标签与真实标签进行比较，但它当然不起作用，因为只有一个样本的数据进行了聚类，而不是全部

什么是合适的方法？

您是否考虑过自己实施搜索

实现for循环并不特别困难。即使您想要优化两个参数，它仍然相当容易

然而，对于DBSCAN和MeanShift，我建议首先了解您的相似性度量。根据对度量的理解来选择参数更为合理，而不是通过参数优化来匹配某些标签（存在过度拟合的高风险）

换句话说，两个物品应该聚集在什么距离

如果从一个数据点到另一个数据点的距离变化过大，这些算法将严重失败；您可能需要找到一个标准化的距离函数，以便实际的相似性值再次有意义。TF-IDF是文本的标准格式，但主要用于检索上下文。在集群环境中，它们可能工作得更糟

还要注意MeanShift（类似于k-means）需要重新计算坐标-在文本数据上，这可能会产生不期望的结果；更新后的坐标实际上变得更差，而不是更好。

以下DBSCAN函数可能会有所帮助。我编写它是为了迭代超参数eps和min_样本，并包含min和max集群的可选参数。由于DBSCAN是无监督的，因此我没有包含评估参数

def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
                       min_samples_space = 5, min_clust = 0, max_clust = 10):

    """
Performs a hyperparameter grid search for DBSCAN.

Parameters:
    * X_data            = data used to fit the DBSCAN instance
    * lst               = a list to store the results of the grid search
    * clst_count        = a list to store the number of non-whitespace clusters
    * eps_space         = the range values for the eps parameter
    * min_samples_space = the range values for the min_samples parameter
    * min_clust         = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
    * max_clust         = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst


Example:

# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :] 
y = iris.target

# Scaling X data
dbscan_scaler = StandardScaler()

dbscan_scaler.fit(X)

dbscan_X_scaled = dbscan_scaler.transform(X)

# Setting empty lists in global environment
dbscan_clusters = []
cluster_count   = []


# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
                   lst = dbscan_clusters,
                   clst_count = cluster_count
                   eps_space = pd.np.arange(0.1, 5, 0.1),
                   min_samples_space = pd.np.arange(1, 50, 1),
                   min_clust = 3,
                   max_clust = 6)

"""

    # Importing counter to count the amount of data in each cluster
    from collections import Counter


    # Starting a tally of total iterations
    n_iterations = 0


    # Looping over each combination of hyperparameters
    for eps_val in eps_space:
        for samples_val in min_samples_space:

            dbscan_grid = DBSCAN(eps = eps_val,
                                 min_samples = samples_val)


            # fit_transform
            clusters = dbscan_grid.fit_predict(X = X_data)


            # Counting the amount of data in each cluster
            cluster_count = Counter(clusters)


            # Saving the number of clusters
            n_clusters = sum(abs(pd.np.unique(clusters))) - 1


            # Increasing the iteration tally with each run of the loop
            n_iterations += 1


            # Appending the lst each time n_clusters criteria is reached
            if n_clusters >= min_clust and n_clusters <= max_clust:

                dbscan_clusters.append([eps_val,
                                        samples_val,
                                        n_clusters])


                clst_count.append(cluster_count)

    # Printing grid search summary information
    print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
    print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")

def dbscan\u网格搜索（X\u数据、lst、clst\u计数、eps\u空间=0.5，
最小样本数（空间=5，最小样本数=0，最大样本数=10）：
"""
对DBSCAN执行超参数网格搜索。
参数：
*X_data=用于拟合DBSCAN实例的数据
*lst=存储网格搜索结果的列表
*clst_count=存储非空白群集数量的列表
*eps\u空间=eps参数的范围值
*min_samples_space=min_samples参数的范围值
*min_clust=每次搜索迭代后，为了将结果附加到lst，所需的最小聚类数
*max_clust=每次搜索迭代后，为了将结果附加到lst所需的最大群集数
例子：
#加载库
从sklearn导入数据集
从sklearn.preprocessing导入StandardScaler
作为pd进口熊猫
#加载虹膜数据集
iris=数据集。加载\u iris（）
X=虹膜。数据[：，：]
y=iris.target
#缩放X数据
dbscan\u scaler=StandardScaler（）
dbscan_scaler.fit（X）
dbscan\u X\u scaled=dbscan\u scaler.transform（X）
#在全局环境中设置空列表
dbscan_集群=[]
群集计数=[]
#输入功能参数
dbscan\u网格搜索（X\u数据=dbscan\u X\u缩放，
lst=dbscan_集群，
clst\u计数=群集\u计数
eps_空间=pd.np.arange（0.1,5,0.1），
最小样本空间=pd.np.arange（1,50,1），
最小值=3，
最大值=6）
"""
#导入计数器以统计每个群集中的数据量
从收款进口柜台
#开始计算总迭代次数
n_迭代次数=0
#在超参数的每个组合上循环
对于eps_空间中的eps_val：
对于最小样本空间中的样本值：
dbscan\u网格=dbscan（eps=eps\u值，
最小样本=样本值）
#拟合变换
clusters=dbscan\u grid.fit\u predict（X=X\u数据）
#计算每个集群中的数据量
群集计数=计数器（群集）
#保存集群的数量
n_clusters=sum（abs（pd.np.unique（clusters））-1
#增加循环每次运行的迭代计数
n_迭代次数+=1
#每次达到n_集群标准时追加lst
如果n_clusters>=min_clust和n_clusters是，我正在自己实现它。我只是想知道scikit-learn
是否支持这种开箱即用的方式，我忽略了一些东西。我的计划是在几个不同的预先标记的数据集上运行网格搜索，并深入了解您指出的潜在问题-感谢您指出风险sklearn.cross_validation
具有各种迭代器，可以生成数据集的拆分（交叉验证、随机拆分等）。这些应该使这个循环非常容易编写。你最后做了什么？Scikit learn从sklearn.model_selection中提供了ParameterGrid，这将帮助你在超参数网格上循环。