尝试在Python字典中查找每个组的最大值

尝试在Python字典中查找每个组的最大值,python,python-3.x,list,dictionary,Python,Python 3.x,List,Dictionary,我有一个10+列的csv,它是根据索引编号分组的。比如说, index othercolumn othercolumn2 sample hits othercolumn3 1 cccc bbbb dog 4 aaaa 1 cccc bbbb cat 1 aaaa 1 cccc bbbb cat 2 aaaa 2 cccc

我有一个10+列的csv,它是根据索引编号分组的。比如说,

index othercolumn othercolumn2 sample hits othercolumn3
1     cccc        bbbb         dog    4    aaaa   
1     cccc        bbbb         cat    1    aaaa   
1     cccc        bbbb         cat    2    aaaa   

2     cccc        bbbb         rat    1    aaaa   
2     cccc        bbbb         dog    1    aaaa   

3     cccc        bbbb         bird   1    aaaa   
3     cccc        bbbb         rat    42   aaaa   
3     cccc        bbbb         cat    3    aaaa  
是否有可能找到每个“组”的最大点击数(按索引)?我不太确定在没有最高命中率的情况下该怎么办,比如示例2,但现在这并不重要。例如,所需的输出类似于

For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.       
For index 3, the highest hits are 42 for sample rat.
到目前为止,我已经使用defaultdict为每个组或索引创建了一个列表字典。但我似乎无法获得最高的点击率并清晰地打印出来。到目前为止,这就是我所拥有的

from collections import defaultdict
import csv

groups = defaultdict(list)

with open('data.csv') as inputfile:
    reader = csv.reader(inputfile)
    next(reader, None)  # skip the header row

    for row in reader:
        groups[row[1]].append([row[17], row[18]]) #row 1 is index, row 17 is my sample column, 18 is the hits column
        
print(groups)

非常感谢您的帮助

希望它能帮助你,而不是漂亮的机器人

import pandas as pd
# assign data of lists.
data = {'idx': ['1', '1', '1', '2',"2","2","2","3","3","3","8","8"],'sample': ['dog', 'cat', 'dog', 'cat',"cat","dog","fish","dog","fish","ostrich","dog","cat"], 'hits': [1, 2, 3, 1,1,2,2,2,3,42,0,55]}
# Create DataFrame.
df = pd.DataFrame(data)
df["key"] = df["idx"].astype(str)+df["hits"].astype(str)
# Print the output.
df.head(10)

idx     sample  hits    key
0   1   dog     1   11
1   1   cat     2   12
2   1   dog     3   13
3   2   cat     1   21
4   2   cat     1   21
5   2   dog     2   22
6   2   fish    2   22
7   3   dog     2   32
8   3   fish    3   33
9   3   ostrich     42  342


max_idx = df.groupby(["idx"]).max("hits")
max_idx = pd.DataFrame(max_idx)
max_idx
    hits
idx     
1   3
2   2
3   42
8   55

max_idx.reset_index(level=0, inplace=True)
max_idx["key"] = max_idx["idx"].astype(str)+max_idx["hits"].astype(str)
df_max = df.loc[(df["hits"].isin(max_idx["hits"])) & (df["idx"].isin(max_idx["idx"]))& (df["key"].isin(max_idx["key"]))]
df_max
    idx     sample  hits    key
2   1   dog     3   13
5   2   dog     2   22
6   2   fish    2   22
9   3   ostrich     42  342
11  8   cat     55  855

for i, j, k in zip(df_max["idx"],df_max["hits"],df_max["sample"]):
    print("For index ", i," the highest hits are ", j," for sample", k,"")
For index  1  the highest hits are  3  for sample dog 
For index  2  the highest hits are  2  for sample dog 
For index  2  the highest hits are  2  for sample fish 
For index  3  the highest hits are  42  for sample ostrich 
For index  8  the highest hits are  55  for sample cat 

您可以使用Pandas“group_by”方法和max函数来计算所需的输出

希望下面的代码能对您有所帮助

import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
c =np.array(data.groupby("index").max())
for i in range(len(c)):
    print('For index '+str(i+1)+', the highest hits are '+str(c[i][3])+' for sample '+c[i][2]+'.')
输出:

For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.

这是非常清晰和简单的,但是由于某种原因,我得到了一个断言错误,没有进一步的解释。你知道为什么吗?请分享错误