Python 计算每个键的唯一值的有效方法_Python_Apache Spark_Pyspark

Python 计算每个键的唯一值的有效方法

python apache-spark pyspark

Python 计算每个键的唯一值的有效方法,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个成员列表，其中有许多属性，其中两个是名称和ID。我希望在RDD中得到一个元组列表。元组将包含作为第一个元素的ID，以及作为第二个元素的与ID相关联的unique名称计数 e、例如：ID，以下是我为完成此任务而编写的代码： IDnametuple = members.map(lambda a: (a.ID, a.name)) # extract only ID and name idnamelist = IDnametuple.groupByKey()

我有一个成员列表，其中有许多属性，其中两个是名称和ID。我希望在RDD中得到一个元组列表。元组将包含作为第一个元素的

ID

，以及作为第二个元素的与ID相关联的

unique

名称计数

e、例如：

ID，

以下是我为完成此任务而编写的代码：

IDnametuple = members.map(lambda a: (a.ID, a.name))   # extract only ID and name
idnamelist = IDnametuple.groupByKey()                 # group the IDs together 
idnameunique_count = (idnamelist
     # set(tup[1]) should extract unique elements, 
     # and len should tell the number of them
    .map(lambda tup: (tup[0], len(set(tup[1])))))

它非常慢，而且比为每个成员计算唯一属性的类似操作慢得多

有没有更快的方法？据我所知，我尝试使用尽可能多的内置程序，这是加快速度的正确方法

没有任何细节，我们只能猜测，但显而易见的选择是

groupByKey

。如果每个id都与大量名称相关联，那么由于大量的洗牌，它可能会非常昂贵。最简单的改进是

aggregateByKey

或

combineByKey

：

create\u combiner=set
def合并_值（acc，x）：
附件增补（x）
返回acc
def合并器（acc1、acc2）：
acc1.更新（acc2）
返回acc1
id_name_unique_count=（id_name_tuple#保持一致的命名约定
.combineByKey（创建合并器、合并值、合并合并器）
.mapValues（len））

如果唯一值的预期数量较大，则您可能更愿意替换近似值的精确方法。一种可能的方法是使用Bloom filter来跟踪唯一值，而不是

set

有关

groupByKey

aggregateByKey

（

reduceebykey

，

combineByKey

）的更多信息，请参阅：

groupByKey

aggregateByKey

combineByKey

create\u combiner=set
def合并_值（acc，x）：
附件增补（x）
返回acc
def合并器（acc1、acc2）：
acc1.更新（acc2）
返回acc1
id_name_unique_count=（id_name_tuple#保持一致的命名约定
.combineByKey（创建合并器、合并值、合并合并器）
.mapValues（len））

set

groupByKey

aggregateByKey

reduceebykey

combineByKey

from operator import add
IDnametuple = sc.parallelize([(0, "a"),(0, "a"),(0, "b"),(1, "a"),(1, "b"),(1, "c"),(2, "z")])
idnameunique_count = (IDnametuple.distinct()
                                  .map(lambda idName : (idName[0], 1))
                                  .reduceByKey(add))

idnameunique\u count.collect（）

[（0,2）、（1,3）、（2,1）]

（0,a”）

groupByKey

reduceByKey

set（）

distinct

from operator import add
IDnametuple = sc.parallelize([(0, "a"),(0, "a"),(0, "b"),(1, "a"),(1, "b"),(1, "c"),(2, "z")])
idnameunique_count = (IDnametuple.distinct()
                                  .map(lambda idName : (idName[0], 1))
                                  .reduceByKey(add))

idnameunique\u count.collect（）

[（0,2）、（1,3）、（2,1）]

（0,a”）

groupByKey

reduceByKey

set（）

distinct

不同的

reduceByKey

distinct

reduceByKey

combineByKey

（（k，v），null）