Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark:难以使用mapreduce函数实现KMeans_Python_Pyspark_Mapreduce_K Means_Azure Databricks - Fatal编程技术网

Python PySpark:难以使用mapreduce函数实现KMeans

Python PySpark:难以使用mapreduce函数实现KMeans,python,pyspark,mapreduce,k-means,azure-databricks,Python,Pyspark,Mapreduce,K Means,Azure Databricks,我目前是分布式数据库类的tasekd,负责使用基于map reduce的方法创建kmeans的实现(是的,我知道它有一个预制函数,但任务是专门执行您自己的方法),虽然我已经找到了方法本身,我正在努力通过适当使用map和reduce函数来实现它 def Find_dist(x,y): sum = 0 vec1= list(x) vec2 = list(y) for i in range(len(vec1)): sum = sum +(vec1[i]-vec2[i])*(ve

我目前是分布式数据库类的tasekd,负责使用基于map reduce的方法创建kmeans的实现(是的,我知道它有一个预制函数,但任务是专门执行您自己的方法),虽然我已经找到了方法本身,我正在努力通过适当使用map和reduce函数来实现它

def Find_dist(x,y):
  sum = 0
  vec1= list(x)
  vec2 = list(y)
  for i in range(len(vec1)):
    sum = sum +(vec1[i]-vec2[i])*(vec1[i]-vec2[i])
  return sum

def mapper(cent, datapoint):
  min = Find_dist(datapoint,cent[0])
  closest = cent[0]
  for i in range(1,len(cent)):
    curr = Find_dist(datapoint,cent[i])
    if curr < min:
      min = curr
      closest = cent[i]
  yield closest,datapoint

def combine(x):
  Key = x[0]
  Values = x[1]
  sum = [0]*len(Key)
  counter = 0
  for datapoint in Values:
    vec = list(datapoint[0])
    counter = counter+1
    sum = sum+vec
  point = Row(vec)
  result = (counter,point)
  yield Key, result

def Reducer(x):
  Key = x[0]
  Values = x[1]
  sum = [0]*len(Key)
  counter = 0
  for datapoint in Values:
    vec = list(datapoint[0])
    counter = counter+1
    sum = sum+vec
  avg = [0]*len(Key)
  for i in range(len(Key)):
    avg[i] = sum[i]/counter
  centroid = Row(avg)
  yield Key, centroid

def kmeans_fit(data,k,max_iter):
  centers = data.rdd.takeSample(False,k,seed=42)
  for i in range(max_iter):
    mapped = data.rdd.map(lambda x: mapper(centers,x))
    combined = mapped.reduceByKeyLocally(lambda x: combiner(x))
    reduced = combined.reduceByKey(lambda x: Reducer(x)).collect()
    flag = True
    for i in range(k):
      if(reduced[i][1] != reduced[i][0] ):
        for j in range(k):
          centers[i] = reduced[i][1]
        flag = False
        break
    if (flag):
      break
  return centers
data = spark.read.parquet("/mnt/ddscoursedatabricksstg/ddscoursedatabricksdata/random_data.parquet")
kmeans_fit(data,5,10)
def Find_dist(x,y):
总和=0
vec1=列表(x)
vec2=列表(y)
对于范围内的i(len(vec1)):
sum=sum+(vec1[i]-vec2[i])*(vec1[i]-vec2[i])
回报金额
def映射器(百分比,数据点):
最小值=查找距离(数据点,分[0])
最近值=分[0]
对于范围(1,len(cent))内的i:
curr=Find_dist(数据点,分[i])
如果电流<最小值:
最小值=当前值
最近的=分[i]
产生最近的数据点
def联合收割机(x):
键=x[0]
值=x[1]
总和=[0]*len(键)
计数器=0
对于值中的数据点:
vec=列表(数据点[0])
计数器=计数器+1
总和=总和+向量
点=行(vec)
结果=(计数器,点)
交出钥匙、结果
def减速器(x):
键=x[0]
值=x[1]
总和=[0]*len(键)
计数器=0
对于值中的数据点:
vec=列表(数据点[0])
计数器=计数器+1
总和=总和+向量
平均值=[0]*len(键)
对于范围内的i(len(Key)):
平均值[i]=总和[i]/计数器
质心=行(平均)
屈服键,质心
def kmeans_fit(数据、k、最大值):
centers=data.rdd.takeSample(False,k,seed=42)
对于范围内的i(最大值):
mapped=data.rdd.map(lambda x:mapper(centers,x))
combined=mapped.reduceByKeyLocal(λx:combiner(x))
reduced=组合的.reduceByKey(λx:Reducer(x)).collect()
flag=True
对于范围(k)内的i:
如果(减少[i][1]!=减少[i][0]):
对于范围(k)内的j:
中心[i]=减少的[i][1]
flag=False
打破
如果(标志):
打破
返回中心
data=spark.read.parquet(“/mnt/ddscoursedatabrickstg/ddscoursedatabricksdata/random_data.parquet”)
kmeans_拟合(数据,5,10)
我的主要问题在很大程度上是在使用数据帧和映射、ReduceByKeyLocal和reducebykey功能时遇到困难。 目前运行在调用ReduceByKeyLocal(lambda x:combiner(x))时失败,因为“ValueError:没有足够的值来解包(预期为2,得到1)”,我真的需要尽快让这一切正常工作,所以请,任何人我都希望在这方面得到帮助,并提前感谢您,我将非常感谢您的帮助