Python 大熊猫分组作业_Python_Pandas

Python 大熊猫分组作业

python pandas

Python 大熊猫分组作业,python,pandas,Python,Pandas,我正在阅读使用python的pandas in-book数据分析中的groupby函数。这里作者提到如下 In [13]: df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], ....: 'key2' : ['one', 'two', 'one', 'two', 'one'], ....: 'data1' : np.random.randn(5), ....: 'data2' : np.random.randn(5)}) In [14]: d

我正在阅读使用python的pandas in-book数据分析中的groupby函数。这里作者提到如下

In [13]: df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
....: 'key2' : ['one', 'two', 'one', 'two', 'one'],
....: 'data1' : np.random.randn(5),
....: 'data2' : np.random.randn(5)})

In [14]: df
Out[14]:
    data1       data2   key1 key2
0   -0.204708 1.393406  a    one
1   0.478943  0.092908  a    two
2   -0.519439 0.281746  b    one
3   -0.555730 0.769023  b    two
4   1.965781  1.246435  a    one


In [21]: states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
In [22]: years = np.array([2005, 2005, 2006, 2005, 2006])

In [23]: df['data1'].groupby([states, years]).mean()
Out[23]:
California 2005 0.478943
           2006 -0.519439
Ohio       2005 -0.380219
           2006 1.965781

我的问题是df['data1']是如何分组的，因为df与州和年份没有联系。我不是在谈论产出是如何产生的。请在执行您可以通过的测试时解释

根据：映射、函数、str或iterable

用于确定groupby的组。如果by是一个函数，则对对象索引的每个值调用它。如果传递了dict或序列，则序列或dict值将用于确定组（序列值首先对齐；请参阅.align（）方法）。如果传递了数据数组，则使用这些值来确定组。str或str列表可以通过self中的列传递给group

iterable
在本例中，您传递了一个iterable或数组。使用数组时，最好与数据帧本身具有相同的长度。否则：

# Doesn't throw an error because the length of `df` is 5 df.groupby(list(range(5))) # This does throw an error df.groupby(list(range(6))) KeyError: 0

pd.Series
考虑序列
s
，其中我们指定长度与
df
相同，并且索引相同

s = pd.Series(list(range(len(df))), df.index)
然后，当我们执行
groupby
时，索引是
对于一个
pd.Series
我们不必关心长度，因为熊猫会为我们做校准

# also works df.groupby(s.append(pd.Series(1, [len(df)])))

str

如果字符串被传递到
gropuby
，pandas将在
groupby
中查找具有该名称的列

功能

在传递函数的情况下，pandas会将该函数映射到
df
索引上，并将结果
iterable
用于groupby。
这是由于pandas中的。Pandas希望根据索引执行大部分操作。
# also works df.groupby(s.append(pd.Series(1, [len(df)])))