Python 为什么DBSCAN群集在电影镜头数据集上返回单个群集? 情景:
我正在对电影镜头数据集执行聚类,其中我有两种格式的数据集: 旧格式:Python 为什么DBSCAN群集在电影镜头数据集上返回单个群集? 情景:,python,pandas,cluster-analysis,dbscan,Python,Pandas,Cluster Analysis,Dbscan,我正在对电影镜头数据集执行聚类,其中我有两种格式的数据集: 旧格式: uid iid rat 941 1 5 941 7 4 941 15 4 941 117 5 941 124 5 941 147 4 941 181 5 941 222 2 941 257 4 941 258 4 941 273 3 941 294 4 uid 1 2 3 4 1 5 3
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
新格式:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
在此基础上,我需要使用KMeans、DBSCAN和HDBSCAN执行集群。
使用KMeans,我可以设置和获取集群
问题
只有DBSCAN和HDBSCAN的问题仍然存在,我无法获得足够数量的集群(我知道我们无法手动设置集群)
尝试过的技术:
- 我用鸢尾花尝试了这个方法,我发现这个物种没有被包括在内。很明显,这是一个字符串,而且是可以预测的,所有的一切都是通过这个数据集(代码片段1)
- 尝试使用旧格式的电影镜头100K(带和不带UID),因为我尝试了一个类比,UID==物种,因此尝试不带它。(片段2)
- 尝试了相同的新格式(有UID和没有UID),但结果是相同的
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
片段1(输出):
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
片段2:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
代码片段2(输出):
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
扩展群集方法用于:
1.主矩阵IBCF
2.主矩阵UBCF
Ch:>?1.
从DBSCAN开始,在
INT64索引:942个条目,0到942
列:1682个条目,1到1682
数据类型:float64(1682)
内存使用:12.1 MB
没有一个
分配的群集为:集合([-1])
如图所示,它只返回1个集群。我想知道我做错了什么。您需要选择合适的参数。如果ε太小,一切都会变成噪音。sklearn不应该为该参数设置默认值,需要为每个数据集选择不同的值 您还需要对数据进行预处理 用kmeans获得“集群”是很简单的,因为它是没有意义的
不要只调用随机函数。您需要了解自己在做什么,否则就是在浪费时间。首先,您需要预处理数据,删除任何无用的属性,如ID和不完整的实例(以防您选择的距离度量无法处理) 很好地理解,这些算法来自两种不同的范例,基于质心(KMeans)和基于密度(DBSCAN和HDBSCAN*)。基于质心的算法通常以簇数作为输入参数,而基于密度的算法则需要邻域数(minPts)和邻域半径(eps) 通常在文献中,邻域数(MINPT)设置为4,半径(eps)通过分析不同的值来确定。您可能会发现HDBSCAN*更易于使用,因为您只需告知邻居的数量(MINPT)
如果在尝试了不同的配置之后,仍然得到了无用的集群,那么可能您的数据根本没有集群,KMeans输出也没有意义。正如@faraway和@Anony mouse所指出的,解决方案在数据集上比编程更具数学性 我们终于可以找出这些集群了。以下是方法:
db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d
Epsilon和Out-lier概念从中变得更加明亮。您是否尝试过使用PCA(例如)在2D空间中观察集群的外观。若整个数据是密集的,并且实际上形成了单个组,那个么您可能会得到单个集群
更改其他参数,如min_samples=5、算法、度量。您可以从sklearn.neights.VALID_METRICS检查算法和度量的可能值。很好的建议,但我不能在这里给出我的实际目标和代码。请理解,我需要完成这两种聚类方法。如果你能指出所需的预处理和要使用的参数,那对我真的很有用。参数记录在那里。如果使用欧几里德距离,预处理与使kmeans返回有意义结果所需的类似(但与kmeans不同,您可以使用与您的神秘目标更相关的其他距离)。使用min_samples=2,您实际上不是在进行DBSCAN,而是单链接聚类。对于真正的DBSCAN,选择更大的最小大小(否则,一切都是密集的)。我尝试增加,但它会返回更多的异常值。有什么解决办法吗@ErichI尝试增加,但它返回的异常值更多。有什么解决办法吗@ErichSo,有没有标准不将min_样本设置为2?有什么方程可以保留min_样本w.r.t.数据集吗?正如前面所说,min_样本