Apache spark 按键对不同值进行火花计数_Apache Spark_Key_Distinct_Pyspark

Apache spark 按键对不同值进行火花计数

apache-spark pyspark

Apache spark 按键对不同值进行火花计数,apache-spark,key,distinct,pyspark,Apache Spark,Key,Distinct,Pyspark,我是一个新手，知道下面的命令。它们按键给出值的计数，并按键给出值的列表 dayToHostPairTuple.countByKey() dayToHostPairTuple.groupByKey() 除了countByKey之外，还有什么简单的替代方法可以只按键计算不同的值吗 #########################################== 下面的代码适用于我。这是基于我收到的答案 dayToHostPairTuple = access_logs.map(lambda l

我是一个新手，知道下面的命令。它们按键给出值的计数，并按键给出值的列表

dayToHostPairTuple.countByKey()
dayToHostPairTuple.groupByKey()

除了countByKey之外，还有什么简单的替代方法可以只按键计算不同的值吗

#########################################== 下面的代码适用于我。这是基于我收到的答案

dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
dayToHostPairTuple=dayToHostPairTuple.sortByKey()
print dayToHostPairTuple.distinct().countByKey()

假设值是可散列的，您可以将

distinct

与

countByKey

一起使用：

dayToHostPairTuple.distinct().countByKey()

from operator import add

dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)

或

reduceByKey

：

dayToHostPairTuple.distinct().countByKey()

from operator import add

dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)

假设值是可散列的，则可以使用

distinct

和

countByKey

：

dayToHostPairTuple.distinct().countByKey()

from operator import add

dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)

或

reduceByKey

：

dayToHostPairTuple.distinct().countByKey()

from operator import add

dayToHostPairTuple.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add)

我建议

dayToHostPairTuple.countApproxDistinctByKey（0.005）

从帮助：

返回此RDD中每个键的近似不同值数。所使用的算法基于streamlib的实践中的HyperLogLog：一种状态的算法工程 Art基数估计算法”，此处提供。相对准确度-相对准确度。较小的值会创建需要更多空间的计数器。它必须大于0.000017

我建议

dayToHostPairTuple.countApproxDistinctByKey（0.005）

从帮助：

我不确定我是否理解你想要实现的目标，但假设我是正确的，你可以通过将你的k，v映射到（k，v），1，然后将其还原为[k，v]，count]来解决这个问题。我厌倦了你的建议。我的代码在下面

dayToHostPairTuple=access\u logs.map（lambda log:（str（log.date\u time.day）+“-”+str（log.host），1））

print dayToHostPairTuple.reduceByKey（）

我应该如何更改第二行？您不想将其作为字符串组合，您想将其作为python数据结构组合。比如dayToHostPairTuple.map（lambda k，v:（k，v），1）我不确定我是否理解您试图实现的目标，但是假设我是正确的，您可以通过将您的k，v映射到（k，v），1，然后将其还原为[k，v]，count]来解决它。我厌倦了您的建议。我的代码在下面

dayToHostPairTuple=access\u logs.map（lambda log:（str（log.date\u time.day）+“-”+str（log.host），1））

print dayToHostPairTuple.reduceByKey（）

我应该如何更改第二行？您不想将其作为字符串组合，您想将其作为python数据结构组合。所以像dayToHostPairTuple.map（lambda k，v：（k，v），1）这样的东西

dayToHostPairTuple.distinct（）.countByKey（）

可以工作。是否有一种按键排序的方法，以便输出按键的升序排列？我尝试了

dayToHostPairTuple.distinct（）.countByKey（）.sortByKey（true）

但是我得到了一个错误：（在不收集的情况下，您可以使用

sortByKey

。本地（countByKey）它只是一个标准的Python dict。您可以提取项目并进行排序。

true

不是有效的Python布尔值。并且countByKey不返回RDDI。建议使用在执行器上执行的reduceByKey，而不是在驱动程序上执行的countByKey，尤其是对于大型数据集。要获得结果，请添加：final_counts=dict（dayToHostPairTuple.collect（））。

dayToHostPairTuple.distinct（）.countByKey（）

有效。是否有方法按键排序，以便输出按键的升序排列？我尝试了

dayToHostPairTuple.distinct（）.countByKey（）.sortByKey（true）

但我得到一个错误：（不收集，您可以在本地使用

sortByKey

。）它只是一个标准的Python dict。您可以提取项目并进行排序。

true

不是有效的Python布尔值。并且countByKey不返回RDDI。建议使用在执行器上执行的reduceByKey，而不是在驱动程序上执行的countByKey，尤其是对于大型数据集。要获得结果，请添加：final_counts=dict（dayToHostPairTuple.collect（）。