Python pandas apply函数，用于向pandas数据帧中的行返回多个值_Python_Pandas_Dataframe_Apply_Iterable Unpacking

Python pandas apply函数，用于向pandas数据帧中的行返回多个值

python pandas dataframe

Python pandas apply函数，用于向pandas数据帧中的行返回多个值,python,pandas,dataframe,apply,iterable-unpacking,Python,Pandas,Dataframe,Apply,Iterable Unpacking,我有一个带有时间索引的数据帧和3列，其中包含3D向量的坐标： x y z ts 2014-05-15 10:38 0.120117 0.987305 0.116211 2014-05-15 10:39 0.117188 0.984375 0.122070 2014-05-15 10:40 0.119141

我有一个带有时间索引的数据帧和3列，其中包含3D向量的坐标：

                         x             y             z
ts
2014-05-15 10:38         0.120117      0.987305      0.116211
2014-05-15 10:39         0.117188      0.984375      0.122070
2014-05-15 10:40         0.119141      0.987305      0.119141
2014-05-15 10:41         0.116211      0.984375      0.120117
2014-05-15 10:42         0.119141      0.983398      0.118164

我想对还返回向量的每一行应用一个转换

def myfunc(a, b, c):
    do something
    return e, f, g

但如果我这样做：

df.apply(myfunc, axis=1)

我以一个元素为元组的Pandas系列结束。这是因为apply将获取myfunc的结果，而不进行解压缩。如何更改myfunc，以便获得一个包含3列的新df

编辑：

下面的所有解决方案都有效。Series解决方案允许使用列名，而List解决方案似乎执行得更快

def myfunc1(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return pd.Series([e,f,g], index=['a', 'b', 'c'])

def myfunc2(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return [e,f,g]

%timeit df.apply(myfunc1 ,axis=1)

100 loops, best of 3: 4.51 ms per loop

%timeit df.apply(myfunc2 ,axis=1)

100 loops, best of 3: 2.75 ms per loop

Series

，它将把它们放在一个数据帧中

def myfunc(a, b, c):
    do something
    return pd.Series([e, f, g])

这样做的好处是，您可以为每个结果列提供标签。如果您返回一个数据帧，它只会为组插入多行。

找到了一个可能的解决方案，将myfunc更改为返回如下np.array：

import numpy as np

def myfunc(a, b, c):
    do something
    return np.array((e, f, g))

有更好的解决方案吗？

只需返回一个列表而不是元组

In [81]: df
Out[81]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164

[5 rows x 3 columns]

In [82]: def myfunc(args):
   ....:        e=args[0] + 2*args[1]
   ....:        f=args[1]*args[2] +1
   ....:        g=args[2] + args[0] * args[1]
   ....:        return [e,f,g]
   ....: 

In [83]: df.apply(myfunc ,axis=1)
Out[83]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327

基于@U2EF1编写的优秀函数，我创建了一个方便的函数，它应用一个指定的函数，将元组返回到数据帧字段，并将结果扩展回数据帧

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

用法：

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
   A
a  1
b  2
c  3

def func(x):
    return x*x, x*x*x

print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])

   A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

希望它能帮助一些人。

我尝试返回一个元组（我使用了类似于

scipy.stats.pearsonr

的函数来返回这种结构），但它返回的是1D序列，而不是我所期望的数据帧。如果我手动创建了一个系列，性能会更差，因此我使用

result\u type

修复了它，如中所述：

在函数中返回序列类似于传递结果\u type='expand'。生成的列名将是序列索引

因此，您可以通过以下方式编辑代码：

def myfunc(a, b, c):
    # do something
    return (e, f, g)

df.apply(myfunc, axis=1,  result_type='expand')

Pandas 1.0.5具有数据帧。使用参数结果类型应用，该类型在此处有帮助。从文档中：

These only act when axis=1 (columns): ‘expand’ : list-like results will be turned into columns. ‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’. ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

其他人的一些回答包含错误，因此我在下面对其进行了总结。完美的答案如下
准备数据集。pandas的版本使用
1.1.5

将numpy导入为np 作为pd进口熊猫导入时间信息 #检查熊猫版本打印（pd.\U版本\U） # 1.1.5 #准备数据帧 df=pd.DataFrame({ “x”：[0.120117,0.117188,0.119141,0.116211,0.119141]， ‘y’：[0.987305,0.984375,0.987305,0.984375,0.983398]， ‘z’：[0.116211,0.122070,0.119141,0.120117,0.118164]，索引=[ '2014-05-15 10:38', '2014-05-15 10:39', '2014-05-15 10:40', '2014-05-15 10:41', '2014-05-15 10:42'], 列=['x'、'y'、'z']） df.index.name='ts' #x y z #ts # 2014-05-15 10:38 0.120117 0.987305 0.116211 # 2014-05-15 10:39 0.117188 0.984375 0.122070 # 2014-05-15 10:40 0.119141 0.987305 0.119141 # 2014-05-15 10:41 0.116211 0.984375 0.120117 # 2014-05-15 10:42 0.119141 0.983398 0.118164
解决方案01。在apply函数中返回pd.Series

def myfunc1（参数）： e=args[0]+2*args[1] f=args[1]*args[2]+1 g=args[2]+args[0]*args[1] 返回pd.系列（[e，f，g]） df[['e'，'f'，'g']]=df.apply（myfunc1，axis=1） #x y z e f g #ts # 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803 # 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427 # 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770 # 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512 # 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327 t1=timeit.timeit( “df.apply（myfunc1，轴=1）”，全局变量=dict（df=df，myfunc1=myfunc1），数值=10000）打印（圆形（t1，3），“秒”） #14.571秒
解决方案02。应用时使用
result\u type='expand'

def myfunc2（参数）： e=args[0]+2*args[1] f=args[1]*args[2]+1 g=args[2]+args[0]*args[1] 返回[e，f，g] df['e'，'f'，'g']]=df.apply（myfunc2，axis=1，result_type='expand'） #x y z e f g #ts # 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803 # 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427 # 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770 # 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512 # 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327 t2=timeit.timeit( “df.apply（myfunc2，axis=1，result_type='expand'）”，全局变量=dict（df=df，myfunc2=myfunc2），数值=10000）打印（圆形（t2，3），“秒”） #9.907秒
解决方案03。如果您想让它更快，请使用
np.vectorize
。请注意，使用
np.vectorize
时，args不能是单个参数

def myfunc3（args0、args1、args2）： e=args0+2*args1 f=args1*args2+1 g=args2+args0*args1 返回[e，f，g] df['e'，'f'，'g']]=pd.DataFrame（np.row_堆栈（np.vectorize（myfunc3，otypes=['O']）（df['x']，df['y']，df['z']），index=df.index） #x y z e f g #ts # 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803 # 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427 # 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770 # 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512 # 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327 t3=timeit.timeit( “pd.DataFrame（np.row_）堆栈（np.vectorize（myfunc3，otypes=