Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将groupby结果直接合并回dataframe_Python_Pandas_Pandas Groupby - Fatal编程技术网

Python 将groupby结果直接合并回dataframe

Python 将groupby结果直接合并回dataframe,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,假设我有以下数据: df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x']) id1 id2 x 0 1 1 10 1 1 2 20 2 1 3 50 3 2 1 15 4 2 2 20 5 2 3 30

假设我有以下数据:

df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])


   id1  id2   x
0    1    1  10
1    1    2  20
2    1    3  50
3    2    1  15
4    2    2  20
5    2    3  30
6    3    1  40
7    3    2  70
数据帧沿着两个ID进行排序。假设我想知道每组
id1
观察中第一次观察的x值。结果是

id1 id2 x   first_x
1   1   10  10
1   2   30  10
1   3   50  10
2   1   15  15
2   2   20  15
2   3   30  15
3   1   40  40
3   2   70  40
我如何实现这种“订阅”?理想情况下,每次观察都会填写新栏

我的思路是

df['first_x'] = df.groupby(['id1'])[0]
像这样的

df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')

因为在为每一行建立值时需要考虑整个数据帧,所以需要中间步骤。

下面使用group by首先获取您的
值,然后将其用作映射以添加新列

import pandas as pd

df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])

first_xs = df.groupby(['id1']).first().to_dict()['x']

df['first_x'] = df['id1'].map(lambda id: first_xs[id])
我认为最简单的是:

或由以下人员创建的系列

首先是最短和最快的解决方案:

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
                   'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)

In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop

In [180]: %%timeit
     ...: first_xs = df.groupby(['id1']).first().to_dict()['x']
     ...: 
     ...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
     ...: 
1 loop, best of 3: 524 ms per loop

In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop

In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop

我说的最小值是‘x’对吗?我希望有第一个值(在我的例子中正好对应min)。是的,没问题!我修改了GROUPBY函数,改为使用first()!谢谢是否无法避免要重命名的
x
列?我有一个巨大的数据帧,不能定期重命名ColumnSEAP,我得到了你需要的!只需使用“加入熊猫系列”即可:)
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])

print (df)
   id1  id2   x  first_x
0    1    1  10       10
1    1    2  20       10
2    1    3  50       10
3    2    1  15       15
4    2    2  20       15
5    2    3  30       15
6    3    1  40       40
7    3    2  70       40
np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
                   'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)

In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop

In [180]: %%timeit
     ...: first_xs = df.groupby(['id1']).first().to_dict()['x']
     ...: 
     ...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
     ...: 
1 loop, best of 3: 524 ms per loop

In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop

In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop