Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark:为RDD对中的每个键创建直方图_Python_Apache Spark_Histogram_Pyspark_Rdd - Fatal编程技术网

Python Pyspark:为RDD对中的每个键创建直方图

Python Pyspark:为RDD对中的每个键创建直方图,python,apache-spark,histogram,pyspark,rdd,Python,Apache Spark,Histogram,Pyspark,Rdd,我是Pypark的新手。我有一对RDD密钥,值。我想为每个关键点创建一个包含n个bucket的直方图。输出如下所示: [(key1, [...buckets...], [...counts...]), (key2, [...buckets...], [...counts...])] 我见过检索每个键的最大值或总和的示例,但是否有方法将Historogramn函数传递到每个键的值?尝试: >>> import numpy as np >>> >>

我是Pypark的新手。我有一对RDD密钥,值。我想为每个关键点创建一个包含n个bucket的直方图。输出如下所示:

[(key1, [...buckets...], [...counts...]),
 (key2, [...buckets...], [...counts...])]
我见过检索每个键的最大值或总和的示例,但是否有方法将Historogramn函数传递到每个键的值?

尝试:

>>> import numpy as np
>>>
>>> rdd.groupByKey().map(lambda (x, y): np.histogram(list(y)))

我知道这篇文章已经很老了,但对于仍在寻找PySpark解决方案的人来说,这是我在这个问题上的两分钱

让我们考虑一个键,值对RDD,让我们通过直方图,我们主要是一个平原计数器,我们有多少不同的值,每个键,以及它们各自的基数。 这是一个好办法。在aggregateByKey中,基本上声明了三个输入值:聚合器默认值、分区内聚合函数和分区间聚合函数

让我们考虑对于表格

有一个RDD。
[(124, 2),
 (124, 2),
 (124, 2),
 (125, 2),
 (125, 2),
 (125, 2),
 (126, 2),
 (126, 2),
 (126, 2),
 (127, 2),
 (127, 2),
 (127, 2),
 (128, 2),
 (128, 2),
 (128, 2),
 (129, 2),
 (129, 2),
 (129, 2),
 (130, 2),
 (130, 2),
 (130, 2),
 (131, 2),
 (131, 2),
 (131, 2),
 (132, 2),
 (132, 2),
 (132, 2),
 (133, 2),
 (133, 2),
 (133, 2),
 (134, 2),
 (134, 2),
 (134, 2),
 (135, 2),
 (135, 2),
 (135, 2),
 (136, 2),
 (136, 1),
 (136, 2),
 (137, 2),
 (137, 2),
 (137, 2),
 (138, 2),
 (138, 2),
 (138, 2),
 (139, 2),
 (139, 2),
 (139, 2),
 (140, 2),
 (140, 2),
 (140, 2),
 (141, 2),
 (141, 1),
 (141, 1),
 (142, 2),
 (142, 2),
 (142, 2),
 (143, 2),
 (143, 2),
 (143, 2),
 (144, 1),
 (144, 1),
 (144, 2),
 (145, 1),
 (145, 1),
 (145, 1),
 (146, 2),
 (146, 2),
 (146, 2),
 (147, 2),
 (147, 2),
 (147, 2),
 (148, 2),
 (148, 2),
 (148, 2),
 (149, 2),
 (149, 2),
 (149, 2),
 (150, 2),
 (150, 2),
 (150, 2),
 (151, 2),
 (151, 2),
 (151, 2),
 (152, 2),
 (152, 2),
 (152, 2),
 (153, 2),
 (153, 1),
 (153, 2),
 (154, 2),
 (154, 2),
 (154, 2),
 (155, 2),
 (155, 1),
 (155, 2),
 (156, 2),
 (156, 2),
 (156, 2),
 (157, 1),
 (157, 2),
 (157, 2),
 (158, 2),
 (158, 2),
 (158, 2),
 (159, 2),
 (159, 2),
 (159, 2),
 (160, 2),
 (160, 2),
 (160, 2),
 (161, 2),
 (161, 1),
 (161, 2),
 (162, 2),
 (162, 2),
 (162, 2),
 (163, 2),
 (163, 1),
 (163, 2),
 (164, 2),
 (164, 2),
 (164, 2),
 (165, 2),
 (165, 2),
 (165, 2),
 (166, 2),
 (166, 1),
 (166, 2),
 (167, 2),
 (167, 2),
 (167, 2),
 (168, 2),
 (168, 1),
 (168, 1),
 (169, 2),
 (169, 2),
 (169, 2),
 (170, 2),
 (170, 2),
 (170, 2),
 (171, 2),
 (171, 2),
 (171, 2),
 (172, 2),
 (172, 2),
 (172, 2),
 (173, 2),
 (173, 2),
 (173, 1),
 (174, 2),
 (174, 1),
 (174, 1),
 (175, 1),
 (175, 1),
 (175, 1),
 (176, 1),
 (176, 1),
 (176, 1),
 (177, 2),
 (177, 2),
 (177, 2)]
据我所知,最简单的方法是根据Python字典聚合每个键中的值,其中字典键是RDD值,与每个字典键关联的值是每个RDD值有多少RDD值的计数器。不需要考虑RDD密钥,因为aggregateByKey函数会自动处理RDD密钥

聚合调用的形式为

myRDD.aggregateByKey(dict(), withinPartition, betweenPartition)
我们将所有累加器初始化为空字典

因此,分区内聚合函数具有以下形式

def withinPartition(dictionary, record):
    if record in dictionary.keys():
        dictionary[record] += 1
    else:
        dictionary[record] = 1
    return dictionary
其中dictionary是每RDD值计数器,而record是给定RDD值的整数,在本例中,请参见上面的RDD示例。基本上,如果字典中已经存在给定的RDD值,我们将增加一个+1计数器。否则,我们初始化计数器

between配分函数的工作原理基本相同

def betweenPartition(dictionary1, dictionary2):
    return {k: dictionary1.get(k, 0) + dictionary2.get(k, 0) for k in set(dictionary1) | set(dictionary2)}
基本上,对于给定的RDD键,让我们考虑有两个字典。我们通过对给定键的值求和,或者如果给定键不存在于两个字典或逻辑中,则添加一个给定键,从而将这两个字典合并为一个唯一的字典。字典合并的信用卡

生成的RDD将具有以下形式

[(162, {2: 3}),
 (132, {2: 3}),
 (168, {1: 2, 2: 1}),
 (138, {2: 3}),
 (174, {1: 2, 2: 1}),
 (144, {1: 2, 2: 1}),
 (150, {2: 3}),
 (156, {2: 3}),
 (126, {2: 3}),
 (163, {1: 1, 2: 2}),
 (133, {2: 3}),
 (169, {2: 3}),
 (139, {2: 3}),
 (175, {1: 3}),
 (145, {1: 3}),
 (151, {2: 3}),
 (157, {1: 1, 2: 2}),
 (127, {2: 3}),
 (128, {2: 3}),
 (164, {2: 3}),
 (134, {2: 3}),
 (170, {2: 3}),
 (140, {2: 3}),
 (176, {1: 3}),
 (146, {2: 3}),
 (152, {2: 3}),
 (158, {2: 3}),
 (129, {2: 3}),
 (165, {2: 3}),
 (135, {2: 3}),
 (171, {2: 3}),
 (141, {1: 2, 2: 1}),
 (177, {2: 3}),
 (147, {2: 3}),
 (153, {1: 1, 2: 2}),
 (159, {2: 3}),
 (160, {2: 3}),
 (130, {2: 3}),
 (166, {1: 1, 2: 2}),
 (136, {1: 1, 2: 2}),
 (172, {2: 3}),
 (142, {2: 3}),
 (148, {2: 3}),
 (154, {2: 3}),
 (124, {2: 3}),
 (161, {1: 1, 2: 2}),
 (131, {2: 3}),
 (167, {2: 3}),
 (137, {2: 3}),
 (173, {1: 1, 2: 2}),
 (143, {2: 3}),
 (149, {2: 3}),
 (155, {1: 1, 2: 2}),
 (125, {2: 3})]

在这个新的RDD中仍然可以找到原始的RDD密钥。每个新的RDD值都是一个字典。反过来,每个字典键对应一个可能的RDD值,而每个字典值是每个RDD键的给定RDD值存在多少次的计数器

这对我不起作用。np.histogram不接受groupByKey生成的“ResultIterable”。请解释为什么您认为这会有所帮助。只有代码的答案通常不是很有用。