kmeans集群后查找功能重要性的python代码

kmeans集群后查找功能重要性的python代码,python,scikit-learn,cluster-analysis,k-means,feature-selection,Python,Scikit Learn,Cluster Analysis,K Means,Feature Selection,我研究了找到特性重要性的方法(我的数据集只有9个特性) 我希望对影响集群形成的每个特征进行排名 计算每个尺寸的质心方差。方差最大的维度在区分聚类时最为重要 如果您只有少量的变量,您可以执行某种类型的遗漏测试(删除1个变量并重做集群)。还要记住,k-means依赖于初始化,所以在重做集群时,您希望保持该值不变 任何python代码都可以实现这一点吗?请考虑这样做功能选择 import pandas as pd import numpy as np import seaborn as sns fro

我研究了找到特性重要性的方法(我的数据集只有9个特性)

我希望对影响集群形成的每个特征进行排名

  • 计算每个尺寸的质心方差。方差最大的维度在区分聚类时最为重要

  • 如果您只有少量的变量,您可以执行某种类型的遗漏测试(删除1个变量并重做集群)。还要记住,k-means依赖于初始化,所以在重做集群时,您希望保持该值不变


  • 任何python代码都可以实现这一点吗?

    请考虑这样做功能选择

    import pandas as pd
    import numpy as np
    import seaborn as sns
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import chi2
    
    # UNIVARIATE SELECTION
    
    data = pd.read_csv('C:\\Users\\Excel\\Desktop\\Briefcase\\PDFs\\1-ALL PYTHON & R CODE SAMPLES\\Feature Selection - Machine Learning\\train.csv')
    X = data.iloc[:,0:20]  #independent columns
    y = data.iloc[:,-1]    #target column i.e price range
    
    #apply SelectKBest class to extract top 10 best features
    bestfeatures = SelectKBest(score_func=chi2, k=10)
    fit = bestfeatures.fit(X,y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Specs','Score']  #naming the dataframe columns
    print(featureScores.nlargest(10,'Score'))  #print 10 best features
    
    
    # FEATURE IMPORTANCE
    data = pd.read_csv('C:\\your_path\\train.csv')
    X = data.iloc[:,0:20]  #independent columns
    y = data.iloc[:,-1]    #target column i.e price range
    from sklearn.ensemble import ExtraTreesClassifier
    import matplotlib.pyplot as plt
    model = ExtraTreesClassifier()
    model.fit(X,y)
    print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
    #plot graph of feature importances for better visualization
    feat_importances = pd.Series(model.feature_importances_, index=X.columns)
    feat_importances.nlargest(10).plot(kind='barh')
    plt.show()
    


    到目前为止,您尝试了什么,以及您的尝试出了什么问题?请提供一个答案,但问题是关于k的意思,上面是一个监督问题吗?这两个联系如何?这个答案只有在你知道你的依赖变量的情况下才有效,这表明了一个有监督的问题,而不是k-均值聚类的情况,它是无监督的
    # Correlation Matrix with Heatmap
    data = pd.read_csv('C:\\your_path\\train.csv')
    X = data.iloc[:,0:20]  #independent columns
    y = data.iloc[:,-1]    #target column i.e price range
    #get correlations of each features in dataset
    corrmat = data.corr()
    top_corr_features = corrmat.index
    plt.figure(figsize=(20,20))
    #plot heat map
    g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
    
    Dataset is available here:
    
    https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv