Python 熊猫:groupby即将上市
我有如下数据:Python 熊猫:groupby即将上市,python,pandas,Python,Pandas,我有如下数据: id value time 1 5 2000 1 6 2000 1 7 2000 1 5 2001 2 3 2000 2 3 2001 2 4 2005 2 5 2005 3 3 2000 3 6 2005 [[5,6,7],[5]] (this is for id 1 grouped by the id and year) [[3],[3],[4,5]] (this is for id 2
id value time
1 5 2000
1 6 2000
1 7 2000
1 5 2001
2 3 2000
2 3 2001
2 4 2005
2 5 2005
3 3 2000
3 6 2005
[[5,6,7],[5]] (this is for id 1 grouped by the id and year)
[[3],[3],[4,5]] (this is for id 2 grouped by the id and year)
[[3],[6]] (same logic as above)
我的最终目标是在如下列表中列出数据:
id value time
1 5 2000
1 6 2000
1 7 2000
1 5 2001
2 3 2000
2 3 2001
2 4 2005
2 5 2005
3 3 2000
3 6 2005
[[5,6,7],[5]] (this is for id 1 grouped by the id and year)
[[3],[3],[4,5]] (this is for id 2 grouped by the id and year)
[[3],[6]] (same logic as above)
我使用df.groupby(['id','year'])
对数据进行了分组。但在那之后,我无法访问这些组并获取上述格式的数据 您可以使用应用(列表)
:
如果您真的希望它的格式与您显示的完全相同,那么您可以按id
分组并再次应用list
,但这并不高效,而且这种格式可能更难使用
>>> df.groupby(['id','time'])['value'].apply(list).groupby('id').apply(list).tolist()
[[[5, 6, 7], [5]], [[3], [3], [4, 5]], [[3], [6]]]
您可以执行以下操作:
import pandas as pd
data = [[1, 5, 2000],
[1, 6, 2000],
[1, 7, 2000],
[1, 5, 2001],
[2, 3, 2000],
[2, 3, 2001],
[2, 4, 2005],
[2, 5, 2005],
[3, 3, 2000],
[3, 6, 2005]]
df = pd.DataFrame(data=data, columns=['id', 'value', 'year'])
result = []
for name, group in df.groupby(['id']):
result.append([g['value'].values.tolist() for _, g in group.groupby(['year'])])
for e in result:
print(e)
df = pd.DataFrame(
{'A': [1,1,2,2,2,2,3],
'B':['a','b','c','d','e','f','g'],
'C':['x','y','z','x','y','z','x']})
df.groupby('A').agg({ 'B': lambda x: list(x),'C': lambda x: list(x)})
输出
[[5, 6, 7], [5]]
[[3], [3], [4, 5]]
[[3], [6]]
如果要计算多列的列表,可以执行以下操作:
import pandas as pd
data = [[1, 5, 2000],
[1, 6, 2000],
[1, 7, 2000],
[1, 5, 2001],
[2, 3, 2000],
[2, 3, 2001],
[2, 4, 2005],
[2, 5, 2005],
[3, 3, 2000],
[3, 6, 2005]]
df = pd.DataFrame(data=data, columns=['id', 'value', 'year'])
result = []
for name, group in df.groupby(['id']):
result.append([g['value'].values.tolist() for _, g in group.groupby(['year'])])
for e in result:
print(e)
df = pd.DataFrame(
{'A': [1,1,2,2,2,2,3],
'B':['a','b','c','d','e','f','g'],
'C':['x','y','z','x','y','z','x']})
df.groupby('A').agg({ 'B': lambda x: list(x),'C': lambda x: list(x)})
将同时计算B和C的列表:
B C
A
1 [a, b] [x, y]
2 [c, d, e, f] [z, x, y, z]
3 [g] [x]
每次我注意到自己在键入
.apply(…)
时,我都会想“你正在走出熊猫”。还有一个原因是有句谚语:“apply()
很慢,而且没有矢量化”。然而,我必须开始习惯它。有时候这是最好的方法。