Apache spark 将群集分配给spark DataFrame中存储的数据点
我有两个spark数据帧 架构数据帧A(存储集群质心):Apache spark 将群集分配给spark DataFrame中存储的数据点,apache-spark,dataframe,pyspark,spark-dataframe,euclidean-distance,Apache Spark,Dataframe,Pyspark,Spark Dataframe,Euclidean Distance,我有两个spark数据帧 架构数据帧A(存储集群质心): cluster_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos entity_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos import pandas as pd from scipy.spatial.distance import cdist cluster = DataFrame({'cluster_id': [1, 2, 3, 7
cluster_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
entity_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
import pandas as pd
from scipy.spatial.distance import cdist
cluster = DataFrame({'cluster_id': [1, 2, 3, 7],
'dim1_pos': [201, 204, 203, 204],
'dim2_pos':[55, 40, 84, 31]})
entity = DataFrame({'entity_id': ['A', 'B', 'C'],
'dim1_pos': [201, 204, 203],
'dim2_pos':[55, 40, 84]})
cluster.set_index('cluster_id',inplace=True)
entity.set_index('entity_id',inplace=True)
result_metric= cdist(cluster, entity, metric='euclidean')
result_df = pd.DataFrame(result_metric,index=cluster.index.values,columns=entity.index.values)
print result_df
A B C
1 0.000000 15.297059 29.068884
2 15.297059 0.000000 44.011362
3 29.068884 44.011362 0.000000
7 24.186773 9.000000 53.009433
数据帧B的模式(数据点):
cluster_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
entity_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
import pandas as pd
from scipy.spatial.distance import cdist
cluster = DataFrame({'cluster_id': [1, 2, 3, 7],
'dim1_pos': [201, 204, 203, 204],
'dim2_pos':[55, 40, 84, 31]})
entity = DataFrame({'entity_id': ['A', 'B', 'C'],
'dim1_pos': [201, 204, 203],
'dim2_pos':[55, 40, 84]})
cluster.set_index('cluster_id',inplace=True)
entity.set_index('entity_id',inplace=True)
result_metric= cdist(cluster, entity, metric='euclidean')
result_df = pd.DataFrame(result_metric,index=cluster.index.values,columns=entity.index.values)
print result_df
A B C
1 0.000000 15.297059 29.068884
2 15.297059 0.000000 44.011362
3 29.068884 44.011362 0.000000
7 24.186773 9.000000 53.009433
数据帧A中大约有100行,这意味着我有100个集群质心。我需要将数据帧B中的每个实体映射到最近的一个集群(根据欧几里德距离)
我该怎么做
我想要一个带有模式的数据框:实体id,集群id作为我的最终结果。如果Spark数据框不是很大,你可以使用
toPandas()
将其转换为一个数据框,并使用scipy.spatial.distance.cdist()
(阅读了解更多信息)
示例代码:
cluster_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
entity_id, dim1_pos, dim2_pos, dim3_pos, ..., dimN_pos
import pandas as pd
from scipy.spatial.distance import cdist
cluster = DataFrame({'cluster_id': [1, 2, 3, 7],
'dim1_pos': [201, 204, 203, 204],
'dim2_pos':[55, 40, 84, 31]})
entity = DataFrame({'entity_id': ['A', 'B', 'C'],
'dim1_pos': [201, 204, 203],
'dim2_pos':[55, 40, 84]})
cluster.set_index('cluster_id',inplace=True)
entity.set_index('entity_id',inplace=True)
result_metric= cdist(cluster, entity, metric='euclidean')
result_df = pd.DataFrame(result_metric,index=cluster.index.values,columns=entity.index.values)
print result_df
A B C
1 0.000000 15.297059 29.068884
2 15.297059 0.000000 44.011362
3 29.068884 44.011362 0.000000
7 24.186773 9.000000 53.009433
然后,您可以使用idxmin()
和指定轴来从度量的每一行中查找最小对,如下所示:
# get the min. pair
result = DataFrame(result_df.idxmin(axis=1,skipna=True))
# turn the index value into column
result.reset_index(level=0, inplace=True)
# rename and order the columns
result.columns = ['cluster_id','entity_id']
result = result.reindex(columns=['entity_id','cluster_id'])
print result
entity_id cluster_id
0 A 1
1 B 2
2 C 3
3 B 7
最后,我使用VectorAssembler将所有dimX列的值放入一个列中(对于每个数据帧) 一旦这样做了,我就简单地使用UDF的组合来得到答案
import numpy as np
featureCols = [dim1_pos, dim2_pos, ..., dimN_pos]
vecAssembler = VectorAssembler(inputCols=featureCols, outputCol="features")
dfA = vecAssembler.transform(dfA)
dfB = vecAssembler.transform(dfB)
def distCalc(a, b):
return np.sum(np.square(a-b))
def closestPoint(point_x, centers):
udf_dist = udf(lambda x: distCalc(x, point_x), DoubleType())
centers = centers.withColumn('distance',udf_dist(centers.features))
centers.registerTempTable('t1')
bestIndex = #write a query to get minimum distance from centers df
return bestIndex
udf_closestPoint = udf(lambda x: closestPoint(x, dfA), IntegerType())
dfB = dfB.withColumn('cluster_id',udf_closestPoint(dfB.features))
这对spark数据帧有效吗?讨论中的数据帧是spark。如果我转换为pandas数据帧,它不会将整个df带到一台机器中吗?您可以应用
toPandas()
将spark数据帧完美地转换为pandas数据帧。更多信息请阅读。欢呼:)但是数据量是巨大的,转换成熊猫数据帧将创建内存中的数据帧。本地内存太多,无法处理。请参阅:我不熟悉Scipy使用PySpark dataframe。因此,您可能需要使用pyspark.sql.functions
中的udf
计算欧几里德距离。是的,我使用VectorAssembler将数据帧的dimX列转换为一列,然后使用udf使用欧几里德距离。:)