尝试在Python字典中查找每个组的最大值
我有一个10+列的csv,它是根据索引编号分组的。比如说,尝试在Python字典中查找每个组的最大值,python,python-3.x,list,dictionary,Python,Python 3.x,List,Dictionary,我有一个10+列的csv,它是根据索引编号分组的。比如说, index othercolumn othercolumn2 sample hits othercolumn3 1 cccc bbbb dog 4 aaaa 1 cccc bbbb cat 1 aaaa 1 cccc bbbb cat 2 aaaa 2 cccc
index othercolumn othercolumn2 sample hits othercolumn3
1 cccc bbbb dog 4 aaaa
1 cccc bbbb cat 1 aaaa
1 cccc bbbb cat 2 aaaa
2 cccc bbbb rat 1 aaaa
2 cccc bbbb dog 1 aaaa
3 cccc bbbb bird 1 aaaa
3 cccc bbbb rat 42 aaaa
3 cccc bbbb cat 3 aaaa
是否有可能找到每个“组”的最大点击数(按索引)?我不太确定在没有最高命中率的情况下该怎么办,比如示例2,但现在这并不重要。例如,所需的输出类似于
For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.
到目前为止,我已经使用defaultdict为每个组或索引创建了一个列表字典。但我似乎无法获得最高的点击率并清晰地打印出来。到目前为止,这就是我所拥有的
from collections import defaultdict
import csv
groups = defaultdict(list)
with open('data.csv') as inputfile:
reader = csv.reader(inputfile)
next(reader, None) # skip the header row
for row in reader:
groups[row[1]].append([row[17], row[18]]) #row 1 is index, row 17 is my sample column, 18 is the hits column
print(groups)
非常感谢您的帮助 希望它能帮助你,而不是漂亮的机器人
import pandas as pd
# assign data of lists.
data = {'idx': ['1', '1', '1', '2',"2","2","2","3","3","3","8","8"],'sample': ['dog', 'cat', 'dog', 'cat',"cat","dog","fish","dog","fish","ostrich","dog","cat"], 'hits': [1, 2, 3, 1,1,2,2,2,3,42,0,55]}
# Create DataFrame.
df = pd.DataFrame(data)
df["key"] = df["idx"].astype(str)+df["hits"].astype(str)
# Print the output.
df.head(10)
idx sample hits key
0 1 dog 1 11
1 1 cat 2 12
2 1 dog 3 13
3 2 cat 1 21
4 2 cat 1 21
5 2 dog 2 22
6 2 fish 2 22
7 3 dog 2 32
8 3 fish 3 33
9 3 ostrich 42 342
max_idx = df.groupby(["idx"]).max("hits")
max_idx = pd.DataFrame(max_idx)
max_idx
hits
idx
1 3
2 2
3 42
8 55
max_idx.reset_index(level=0, inplace=True)
max_idx["key"] = max_idx["idx"].astype(str)+max_idx["hits"].astype(str)
df_max = df.loc[(df["hits"].isin(max_idx["hits"])) & (df["idx"].isin(max_idx["idx"]))& (df["key"].isin(max_idx["key"]))]
df_max
idx sample hits key
2 1 dog 3 13
5 2 dog 2 22
6 2 fish 2 22
9 3 ostrich 42 342
11 8 cat 55 855
for i, j, k in zip(df_max["idx"],df_max["hits"],df_max["sample"]):
print("For index ", i," the highest hits are ", j," for sample", k,"")
For index 1 the highest hits are 3 for sample dog
For index 2 the highest hits are 2 for sample dog
For index 2 the highest hits are 2 for sample fish
For index 3 the highest hits are 42 for sample ostrich
For index 8 the highest hits are 55 for sample cat
您可以使用Pandas“group_by”方法和max函数来计算所需的输出 希望下面的代码能对您有所帮助
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
c =np.array(data.groupby("index").max())
for i in range(len(c)):
print('For index '+str(i+1)+', the highest hits are '+str(c[i][3])+' for sample '+c[i][2]+'.')
输出:
For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.
这是非常清晰和简单的,但是由于某种原因,我得到了一个断言错误,没有进一步的解释。你知道为什么吗?请分享错误