Python 计算平均值的最简单的方法
我在3d字典中有如下数据:Python 计算平均值的最简单的方法,python,Python,我在3d字典中有如下数据: movieid, date,customer_id,views 0, (2011,12,22), 0, 22 0, (2011,12,22), 1, 2 0, (2011,12,22), 2, 12 ..... 0, (2011,12,22), 7, 2 0, (2011,12,23), 0, 123 。。 因此,基本上,数据表示一部电影每天被观看的次数。。每个客户(只有8个客户) 现在,我想计算平均每位客户观看一部电影的次数 所以基本上 mo
movieid, date,customer_id,views
0, (2011,12,22), 0, 22
0, (2011,12,22), 1, 2
0, (2011,12,22), 2, 12
.....
0, (2011,12,22), 7, 2
0, (2011,12,23), 0, 123
。。
因此,基本上,数据表示一部电影每天被观看的次数。。每个客户(只有8个客户)
现在,我想计算平均每位客户观看一部电影的次数
所以基本上
movie_id,customer_id, avg_views
0, 0, 33.2
0, 1 , 22.3
and so on
解决这个问题的pythonic方法是什么
萨克斯
编辑:
data = defaultdict(lambda : defaultdict(dict))
date = datetime.datetime(2011,1,22)
data[0][date][0] = 22
print data
defaultdict(<function <lambda> at 0x00000000022F7CF8>,
{0: defaultdict(<type 'dict'>,
{datetime.datetime(2011, 1, 22, 0, 0): {0: 22}}))
注意:客户1在1月23日没有观看id为0的电影
现在答案是
movie_id,customer_id,avg_views
0 , 0 , (22+44)/2
0, 1, (23)/1
sum
使这变得简单。在我的原始版本中,我使用了很多dict.keys()
,但在默认情况下,迭代字典会为您提供键
此函数计算结果的单行:
def average_daily_views(movie_id, customer_id, data):
daily_values = [data[movie_id][date][customer_id] for date in data[movie_id]]
return sum(daily_values)/len(daily_values)
然后你可以循环它,得到你想要的任何形式。也许:
def get_averages(data):
result = [average_daily_views(movie, customer, data) for customer in
data[movie] for movie in data]
我的愿景是:
pool = [
(0, (2011,12,22), 0, 22),
(0, (2011,12,22), 1, 2),
(0, (2011,12,22), 2, 12),
(0, (2011,12,22), 7, 2),
(0, (2011,12,23), 0, 123),
]
def calc(memo, row):
if (row[2] in memo.keys()):
num, value = memo[2]
else:
num, value = 0, 0
memo[row[2]] = (num + 1, value + row[3])
return memo
# dic with sum and number
v = reduce(calc, pool, {})
# calc average
avg = map(lambda x: (x[0], x[1][1] / x[1][0]), v.items())
print dict(avg)
其中,
avg
-是一个带有key=customer\u id和value-average of views的字典我认为您应该稍微调整一下数据结构,以便更好地服务于您的目的:
restructured_data = collections.defaultdict(lambda: collections.deafualtdict(collections.defaultdict(int)))
for movie in data:
for date in data[movie]:
for customer,count in date.iteritems():
restructured_data[customer_id][movie_id][date] += count
averages = collections.defaultdict(dict)
for customer in restructured_data:
for movie in restructured_data[customer]:
avg = sum(restructured_data[customer][movie].itervalues())/float(len(restructured_data[customer][movie]))
averages[movie][customer] = avg
for movie in averages:
for customer, avg in averages[movie].iteritems():
print "%d, %d, %f" %(movie, customer, avg)
希望这对您有所帮助请发布(至少一个条目)保存此数据的三维词典。如果您还可以向我们展示您希望结果的样子……您能否格式化您的
defaultdict
,使其易于阅读?如果需要,使用pprint.pprint
。这是一个相当复杂的defaultdict
。你考虑过使用Numpy吗?实际上,我认为你应该让它成为data[customer\u id][movie\u id][date]=count
restructured_data = collections.defaultdict(lambda: collections.deafualtdict(collections.defaultdict(int)))
for movie in data:
for date in data[movie]:
for customer,count in date.iteritems():
restructured_data[customer_id][movie_id][date] += count
averages = collections.defaultdict(dict)
for customer in restructured_data:
for movie in restructured_data[customer]:
avg = sum(restructured_data[customer][movie].itervalues())/float(len(restructured_data[customer][movie]))
averages[movie][customer] = avg
for movie in averages:
for customer, avg in averages[movie].iteritems():
print "%d, %d, %f" %(movie, customer, avg)