Python 获取列表中所有元素平均值的最有效方法，其中每个元素的出现次数至少是列表模式的一半_Python_List_Numpy_Average

Python 获取列表中所有元素平均值的最有效方法，其中每个元素的出现次数至少是列表模式的一半

python list numpy

Python 获取列表中所有元素平均值的最有效方法，其中每个元素的出现次数至少是列表模式的一半,python,list,numpy,average,Python,List,Numpy,Average,我有一个特定的任务要在python中执行。效率和速度在这里是最重要的，这就是为什么我要发布这个问题我需要得到列表中项目的平均值，但只需要得到列表模式出现次数至少一半的项目的平均值例如，如果列表是[1,2,2,3,4,4,4,4]我需要得到2,2,4,4,4的平均值。由于4是列表的模式，并且出现了四次，因此在四次（两次）中至少出现一半的元素是2。因此，我对所有出现的1和3进行贴现，并对列表进行平均我不确定最有效的方法是什么。我知道如何使用暴力计算解决方案，但这显然不是最快的实现我认为使用n

我有一个特定的任务要在python中执行。效率和速度在这里是最重要的，这就是为什么我要发布这个问题

我需要得到列表中项目的平均值，但只需要得到列表模式出现次数至少一半的项目的平均值

例如，如果列表是

[1,2,2,3,4,4,4,4]

我需要得到

2,2,4,4,4

的平均值。由于

是列表的模式，并且出现了四次，因此在四次（两次）中至少出现一半的元素是

。因此，我对所有出现的

和

进行贴现，并对列表进行平均

我不确定最有效的方法是什么。我知道如何使用暴力计算解决方案，但这显然不是最快的实现

我认为使用

numpy

数组可能是最好的，但是因为我会经常添加到列表中，所以我认为这不是最好的选择

我的其他想法是可能使用

集合

模块中基于

计数器的方法。但再一次，我不知道这样做是最快的还是最合理的
 要获得列表的模式，必须至少遍历整个列表一次（从技术上讲，只要其中一个元素的计数超过列表中的剩余项，就可以停止，但效率可以忽略不计）
Python使用计数器
提供了一种高效且简单的方法
from __future__ import division
from collections import Counter
from itertools import islice

data = [1,2,2,3,4,4,4,4]
c = Counter(data)

# Get the mode
mode, mode_n = c.most_common(1)[0]

# Store the cumulative sum and count so we can compute the mean
# Process the most common element (the mode) first since we
# already have that data.
cumulative_sum = mode * mode_n
cumulative_n = mode_n

# Process the remaining elements. most_common returns the remaining
# elements and their counts in descending order by the number of times
# the appear in the original list.  We can skip the first element since
# we've already processed it.  As soon as an element is less numerous
# than half the mode, we can stop processing further elements.
for val, val_n in islice(c.most_common(), 1, None):
    if val_n < mode_n / 2:
        break
    cumulative_sum += val * val_n
    cumulative_n += val_n

# Compute the Mean
avg = cumulative_sum / cumulative_n

要获得列表的模式，必须至少遍历整个列表一次（从技术上讲，只要其中一个元素的计数超过列表中的剩余项，就可以停止，但效率可以忽略不计）
Python使用计数器
提供了一种高效且简单的方法
from __future__ import division
from collections import Counter
from itertools import islice

data = [1,2,2,3,4,4,4,4]
c = Counter(data)

# Get the mode
mode, mode_n = c.most_common(1)[0]

# Store the cumulative sum and count so we can compute the mean
# Process the most common element (the mode) first since we
# already have that data.
cumulative_sum = mode * mode_n
cumulative_n = mode_n

# Process the remaining elements. most_common returns the remaining
# elements and their counts in descending order by the number of times
# the appear in the original list.  We can skip the first element since
# we've already processed it.  As soon as an element is less numerous
# than half the mode, we can stop processing further elements.
for val, val_n in islice(c.most_common(), 1, None):
    if val_n < mode_n / 2:
        break
    cumulative_sum += val * val_n
    cumulative_n += val_n

# Compute the Mean
avg = cumulative_sum / cumulative_n

如果您决定使用numpy，这里有一个使用numpy.unique
和numpy.average
的简明方法：
In [54]: x = np.array([1, 2, 2, 3, 4, 4, 4, 4])

In [55]: uniqx, counts = np.unique(x, return_counts=True)

In [56]: keep = counts >= 0.5*counts.max()

In [57]: np.average(uniqx[keep], weights=counts[keep])
Out[57]: 3.3333333333333335

请注意，np.unique
对其参数进行排序，因此其时间复杂度为O（n*log（n）），而该问题可以使用O（n）的算法来解决。在基于渐进时间复杂度排除此方法之前，请使用具有典型长度的数组进行一些计时比较。
如果您决定使用numpy，这里有一个使用numpy.unique
和numpy.average
的简明方法：
In [54]: x = np.array([1, 2, 2, 3, 4, 4, 4, 4])

In [55]: uniqx, counts = np.unique(x, return_counts=True)

In [56]: keep = counts >= 0.5*counts.max()

In [57]: np.average(uniqx[keep], weights=counts[keep])
Out[57]: 3.3333333333333335

请注意，np.unique
对其参数进行排序，因此其时间复杂度为O（n*log（n）），而该问题可以使用O（n）的算法来解决。在根据渐近时间复杂度排除此方法之前，请使用长度为典型长度的数组进行一些计时比较。
计数器

解决方案可能是您的最佳选择。有趣的问题：）您是否衡量了某些可能性的性能？计数器解决方案可能是您的最佳选择打赌。有趣的问题：）你衡量了一些可能性的表现了吗？我在我的应用程序中尝试了两种答案，你的答案是最快的，尽管差异不是很显著。我在我的应用程序中尝试了两种答案，你的答案是最快的，虽然差异不是很显著，但我在应用程序的上下文中尝试了这两种答案，并对它们进行了计时。您的速度与基于

计数器的速度大致相同，当列表较小时，速度甚至稍快。然而，随着列表越来越大，您的列表稍微慢了一点。虽然不会太多，但可能会慢一点，从0.1
到0.05
ms。尽管如此，当列表较长时，还是会出现一些情况，因此我不得不接受另一种方法是正确的，但我也会对你的方法进行投票。我在应用程序的上下文中尝试了这两种答案，并对它们进行了计时。您的速度与基于计数器的速度大致相同，当列表较小时，速度甚至稍快。然而，随着列表越来越大，您的列表稍微慢了一点。虽然不会太多，但可能会慢一点，从0.1
到0.05
ms。尽管如此，也会有列表较长的情况，因此我不得不接受另一种方法是正确的，但我也会投票支持你的方法。