Python 将函数应用于每个组（输出不是真正的聚合）_Python_Pandas_Group By

Python 将函数应用于每个组（输出不是真正的聚合）

python pandas

Python 将函数应用于每个组（输出不是真正的聚合）,python,pandas,group-by,Python,Pandas,Group By,我有一个时间序列（=dataframe）列表，并希望为（设备的）每个时间序列计算matrixprofile。一种选择是迭代所有设备——这似乎很慢。第二种选择是按设备分组，并应用UDF。现在的问题是，UDF将返回1:1行，即每组不返回单个标量值，但输出的行数将与输入相同当返回1:1（或至少非标量值）时，是否仍有可能以某种方式将reach组的计算矢量化 import pandas as pd df = pd.DataFrame({ 'foo':[1,2,3], 'baz':[1.1,

我有一个时间序列（=dataframe）列表，并希望为（设备的）每个时间序列计算matrixprofile。一种选择是迭代所有设备——这似乎很慢。第二种选择是按设备分组，并应用UDF。现在的问题是，UDF将返回1:1行，即每组不返回单个标量值，但输出的行数将与输入相同

当返回1:1（或至少非标量值）时，是否仍有可能以某种方式将reach组的计算矢量化

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
    print(g)
    
    this_group = df[df.bar == g]
    # perform a UDF which needs to have all the values per group
    # i.e. for real I want to calculate the matrixprofile for each time-series of a device
    this_group['result'] = this_group.baz.apply(lambda x: 1)
    display(this_group)

print('***************************')

def my_non_scalar1_1_agg_function(x):
    display(pd.DataFrame(x))
    return x

# neatly vectorized application of a non_scalar function
# but this fails as:  Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)

事实上，这（也请参见上面评论中的链接）是一种让它以更快/更理想的方式工作的方法。也许还有更好的选择

import pandas as pd
df = pd.DataFrame({
    'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)

grouped_df = df.groupby(['bar'])

altered = []
for index, subframe in grouped_df:
    display(subframe)
    subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
    altered.append(subframe)
    print (index)
    #print (subframe)
   
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

对于应用于每个不返回非标量值的不同组的非聚合函数，需要跨组迭代方法，然后一起编译

因此，考虑一个列表或DICT理解使用<代码>组（）/<代码>，其次是代码> CONTAAT//COD>。确保方法输入并返回完整的数据帧、序列或数据数组

#列表理解
df_list=[myfunction（sub）用于索引，sub位于df.groupby（['group_column']）]
最终df=pd.concat（df列表）
#听写理解
df_dict={index:myfunction（sub）for index，sub在df.groupby（['group_column']）中]
最终的df=pd.concat（df_dict，ignore_index=True）

为此，我们可能需要查看UDF的详细信息。当然：请查看生成Example数据集并使用

stumpy.stump

UDF的代码。我想第二个（不可接受）答案：也应该在这里工作，并给它一个tryDoes

stumpy.stump

返回单个标量值？指示它返回4列的

ndarray

。请发布一个调用的输出示例以及需要提取的单个标量值。