Python 2.7 pysparkDistributedKmodes库错误_Python 2.7_Apache Spark_Pyspark_Apache Zeppelin

Python 2.7 pysparkDistributedKmodes库错误

python-2.7 apache-spark pyspark

Python 2.7 pysparkDistributedKmodes库错误,python-2.7,apache-spark,pyspark,apache-zeppelin,Python 2.7,Apache Spark,Pyspark,Apache Zeppelin,我正在尝试运行pyspark分布式kmodes示例： import numpy as np data = np.random.choice(["a", "b", "c"], (50000, 10)) data2 = np.random.choice(["e", "f", "g"], (50000, 10)) data = list(data) + list(data2) from random import shuffle shuffle(data) # Create a Spark RDD

我正在尝试运行pyspark分布式kmodes示例：

import numpy as np
data = np.random.choice(["a", "b", "c"], (50000, 10))
data2 = np.random.choice(["e", "f", "g"], (50000, 10))
data = list(data) + list(data2)

from random import shuffle
shuffle(data)

# Create a Spark RDD from our sample data and decrease partitions to max_partions
max_partitions = 32

rdd = sc.parallelize(data)
rdd = rdd.coalesce(max_partitions)

for x in rdd.take(10):
    print x

method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(df.rdd)

print(model.clusters)
print(method.mean_cost)

predictions = method.predictions
datapoints = method.indexed_rdd
combined = datapoints.zip(predictions)
print(combined.take(10))

model.predict(rdd).take(5)

我使用的是Python 2.7、ApacheZeppelin 0.7.1和ApacheSpark 2.1.0

这是输出错误：

('Iteration ', 0)

Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 349, in <module>
        raise Exception(traceback.format_exc())
    Exception: Traceback (most recent call last):
      File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 337, in <module>
        exec(code)
      File "<stdin>", line 13, in <module>
      File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 430, in fit
        self.n_clusters,self.max_dist_iter)
      File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 271, in k_modes_partitioned
        clusters = check_for_empty_cluster(clusters, rdd)
      File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 317, in check_for_empty_cluster
        random_element = random.choice(clusters[biggest_cluster].members)
      File "/usr/lib/python2.7/random.py", line 275, in choice
        return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
    IndexError: list index out of range

（'Iteration'，0）
回溯（最近一次呼叫最后一次）：
文件“/tmp/zeppelin_pyspark-1298251609305129154.py”，第349行，在
引发异常（traceback.format_exc（））
例外情况：回溯（最近一次呼叫最后一次）：
文件“/tmp/zeppelin_pyspark-1298251609305129154.py”，第337行，在
行政主任（代码）
文件“”，第13行，在
文件“/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py”，第430行，以合适的格式
self.n_集群，self.max_dist_iter）
文件“/usr/local/lib/python2.7/dist packages/pyspark_kmodes/pyspark_kmodes.py”，第271行，k_模式_分区
集群=检查集群是否为空集群（集群，rdd）
文件“/usr/local/lib/python2.7/dist packages/pyspark\u kmodes/pyspark\u kmodes.py”，第317行，检查是否存在空集群
random\u element=random.choice（集群[最大的集群].members）
文件“/usr/lib/python2.7/random.py”，第275行，可选
返回seq[int（self.random（）*len（seq））]#如果seq为空，则引发索引器
索引器：列表索引超出范围

用于拟合模型的RDD不是空的，我已经检查过了。我认为这是pyspark分布式kmodes和spark之间版本不兼容的问题，但我不能降低spark的级别

你知道怎么解决吗？

什么是

df

？看起来不像是火花错误。来自的代码在Spark 2.1.0下为我工作。即使我更改了您的这一行代码，它也可以工作：

method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(rdd)

你是正确的，df是错误的。现在它工作得很好。谢谢