在python3中使用大列表求和旅行时间_Python_Performance_Python 3.x_Counter

在python3中使用大列表求和旅行时间

python performance python-3.x

在python3中使用大列表求和旅行时间,python,performance,python-3.x,counter,Python,Performance,Python 3.x,Counter,我有一个非常大的列表（~2GB）记录了不同地点之间的旅行时间。在每个位置之间，列出了多个值，其中一些值重复如下： Raw_Travel_Times=[('AB',2),('BC',5),('AB',8),('BC',10),('BC',7)] 我试图有效地计算每个地点之间的平均旅行时间，例如： Ave_Travel_Times=[('AB',5),('BC',11)] 我认为使用计数器将是一种可行的方法，但我提出的最佳解决方案太慢了： # count how many times each

我有一个非常大的列表（~2GB）记录了不同地点之间的旅行时间。在每个位置之间，列出了多个值，其中一些值重复如下：

Raw_Travel_Times=[('AB',2),('BC',5),('AB',8),('BC',10),('BC',7)]

我试图有效地计算每个地点之间的平均旅行时间，例如：

Ave_Travel_Times=[('AB',5),('BC',11)]

我认为使用

计数器

将是一种可行的方法，但我提出的最佳解决方案太慢了：

# count how many times each Origin-Destination pair occurs
    Trips=dict(Counter(Travel_Times))

{'AB':2,'BC':3}

# total travel time for each Origin-Destination pair
    CTime=Counter(AB)
    for t in Raw_Travel_Times:
      CTime=CTime+Counter({t[0]:t[1]})

    for c in CTime:
       Link=c
       Total_Time=CTime[c]
       Num_Trips=Trips[c]
       Avetime=TotalTime/Num_Trips
       Ave_Travel_Times.append(Link,Avetime)

必须有一种更有效的方法来做到这一点，但我显然看不到这一点。在此方面的任何帮助都将不胜感激。

defaultdict

可能是您想要的：

location_times = [('AB',2),('BC',5),('AB',8),('BC',10),('BC',7)]

from collections import defaultdict
from statistics import mean

dd = defaultdict(list)

for location, time in location_times:
    dd[location].append(time)

result = {location: mean(times) for location, times in dd.items()}

或者，您可以研究学习熊猫库的基础知识。

defaultdict

可能就是您想要的：

location_times = [('AB',2),('BC',5),('AB',8),('BC',10),('BC',7)]

from collections import defaultdict
from statistics import mean

dd = defaultdict(list)

for location, time in location_times:
    dd[location].append(time)

result = {location: mean(times) for location, times in dd.items()}

或者，您可以研究学习熊猫库的基础知识。

您可以尝试对数据进行一次排序，然后对它们进行一次检查以计算平均值。这需要排序（这是额外的工作），但可以避免在列表中添加一百万项（这非常慢）：

积分用于

统计。表示我不知道的。
您可以尝试对数据进行一次排序，然后检查一次以计算平均值。这需要排序（这是额外的工作），但可以避免在列表中添加一百万项（这非常慢）：
积分用于统计。意思是我不知道的。
像您这样的海量同质数据，可能值得切换到numpy。您可能会看到巨大的性能改进，这取决于您对数据所做的操作。@AndrasDeak我建议pandas
更适合此任务。@Denziloe同意，我在第一次响应时并没有通读所有内容。有了像您这样庞大且同质的数据，可能值得切换到numpy。您可能会看到巨大的性能改进，这取决于您对数据的处理。@AndrasDeak我认为pandas
更适合此任务。@Denziloe同意，我在第一次回答时没有通读所有内容。