Python 通过避免循环对pandas中的代码进行矢量化_Python_Python 3.x_Pandas_Dataframe

Python 通过避免循环对pandas中的代码进行矢量化

python python-3.x pandas dataframe

Python 通过避免循环对pandas中的代码进行矢量化,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,考虑以下因素： In [448]: complex_dataframe = pd.DataFrame({'stat_A': [120, 121, 122, 123], ...: 'group_id_A': [1, 1, 1, 2], ...: 'level_A': [1, 2, 2, 1], ...:

考虑以下因素：

In [448]: complex_dataframe = pd.DataFrame({'stat_A': [120, 121, 122, 123],
     ...:                                       'group_id_A': [1, 1, 1, 2],
     ...:                                       'level_A': [1, 2, 2, 1],
     ...:                                       'stat_B': [220, 221, 222, 223],
     ...:                                       'group_id_B': [1, 1, 1, 2],
     ...:                                       'level_B': [1, 1, 2, 2],
     ...:                                       'stat_C': ['aa', 'ab', 'aa', 'ab'],
     ...:                                       'measure_avg_A': [10.5, 11, 20, 12],
     ...:                                       'measure_sum_B': [10, 20, 30, 40]}
     ...:                                      )

In [449]: complex_dataframe
Out[449]: 
   stat_A  group_id_A  level_A  stat_B  group_id_B  level_B stat_C  measure_avg_A  measure_sum_B
0     120           1        1     220           1        1     aa           10.5             10
1     121           1        2     221           1        1     ab           11.0             20
2     122           1        2     222           1        2     aa           20.0             30
3     123           2        1     223           2        2     ab           12.0             40

这里有三列的变量：

stat

、

group\u id

和

level

是

复杂的列，只有stat
的变量是简单的列
因此，上面的A
和B
列是复杂的列。列C
是一个简单的列，以measure\开头的列仅仅是值
用例是：
我需要对所有group\u id\u
列和简单列进行分组。在上述情况下，groupby on:组id\u A
，组id\u B
，统计C

预期产出：

其中measure_A
列如下所示：

我已经使用多个循环对此进行了编码
from collections import Counter

cols_without_measures = complex_dataframe.loc[:, ~complex_dataframe.columns.str.startswith("measure_")].columns.tolist()
cols_without_measures = [i.split('_')[-1] for i in cols_without_measures]
counter = Counter(cols_without_measures)

complex_cols = [k for k, v in counter.items() if v == 3]
simple_cols = list(set(list(counter.keys())).symmetric_difference(set(complex_cols)))
grouped_cols = ['group_id_' + i for i in complex_cols] + ['stat_' + i for i in simple_cols]

grp = self.df_in.groupby(grouped_cols)
complex_df = pd.DataFrame()

for k, v in grp:
    temp = v.loc[:, ~v.columns.isin(grouped_cols)]
    stat_df = temp.loc[:, ~temp.columns.str.startswith('measure_')]
    measure_df = temp.filter(like='measure_', axis=1)
    out_df = v[grouped_cols].head(1)
    for i in measure_df.columns:
        m = stat_df.copy()
        m[i] = measure_df[i]
        out_df[i] = [m.to_dict()]
    complex_df = complex_df.append(out_df)

有没有更好的办法解决这个问题？也许可以以某种方式将其矢量化。
如果您愿意通过更具逻辑性的索引访问分组的数据帧，您可以这样做：
df_grouped = complex_dataframe.groupby(['group_id_A', 'group_id_B', 'stat_C'])

然后，您将通过分组所依据的值的元组访问每个组：
df_grouped.get_group((1, 1, 'ab'))

为您提供数据帧
        stat_A  group_id_A  level_A stat_B  group_id_B  level_B stat_C measure_avg_A    measure_sum_B
    1   121     1           2       21      1           1       ab     11.0             20

您可以使用以下命令遍历组：
for key, item in df_grouped:
    print(key, "\n")
    print(df_grouped.get_group(key), "\n\n")

您可以在pandas中使用groupby来实现这一点，但它使用的是apply和lambda。这不是一个完全矢量化的解决方案
df_complex = complex_dataframe.groupby(['group_id_A','group_id_B','stat_C']).apply(
    lambda x: pd.Series({
        'measure_avg_A': x[['stat_A','level_A','stat_B','level_B','measure_avg_A']].to_dict(),
        'measure_sum_B': x[['stat_A','level_A','stat_B','level_B','measure_sum_B']].to_dict()
    })).reset_index()

然后，您可以根据需要查询数据帧
pd.DataFrame(df_complex.at[0, 'measure_avg_A'])

输出
   stat_A  level_A  stat_B  level_B  measure_avg_A
0     120        1     220        1           10.5
2     122        2     222        2           20.0

不确定这是否更好，但您可以尝试让我知道：
measure_cols = [*complex_dataframe.columns[complex_dataframe.columns
                                           .str.contains("measure")]]
u = complex_dataframe.set_index(grouped_cols)

final=pd.concat([u[u.columns.difference(measure_cols,sort=False).union([i],sort=False)]
        .groupby(grouped_cols).apply(lambda x: x.reset_index(drop=True).to_dict())
        .rename(i)   for i in measure_cols],axis=1).reset_index()


与一起使用，并通过轴1连接
def d(x):
    return x.to_dict()
g = df.groupby(['group_id_A','group_id_B','stat_C'])
one = g[['stat_A','level_A','stat_B','level_B','measure_avg_A']].apply(d)
two = g[['stat_A','level_A','stat_B','level_B','measure_sum_B']].apply(d)

out = pd.concat([one, two], axis=1)
out.columns = ['measure_avg_A', 'measure_sum_B']

从anky那里借用一些代码：p
display = print
final = out 

                                                                  measure_avg_A                                      measure_sum_B
group_id_A group_id_B stat_C                                                                                                      
1          1          aa      {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...  {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...
                      ab      {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...  {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...
2          2          ab      {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...  {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...

   stat_A  level_A  stat_B  level_B  measure_avg_A
0     120        1     220        1           10.5
2     122        2     222        2           20.0

   stat_A  level_A  stat_B  level_B  measure_avg_A
1     121        2     221        1           11.0

   stat_A  level_A  stat_B  level_B  measure_sum_B
0     120        1     220        1             10
2     122        2     222        2             30

   stat_A  level_A  stat_B  level_B  measure_sum_B
1     121        2     221        1             20

虽然这是一个可能的解决方案，但我建议重新考虑您的数据布局。一个更简单的数据框可能会归档相同的功能。这看起来不错，唯一的一点是，pd.Series
中使用的列可以是多个。我必须循环它们吗？如果您有许多列要创建，我会在pd.Series中使用字典理解
。但请注意，这很可能不是一个好的解决方案。除非您有必要在一个数据帧内创建多个数据帧/dict。我理解这一点。请您在回答中使用apply
语句中的dict comprehension
提供解决方案。对不起，现在没有时间。也许这是另一个问题的好话题？解决方案运行良好。看起来完全避免for循环是不可能的。这里使用的函数太多，有点复杂了。@MayankPorwal谢谢你的确认，我现在无法理解无循环解决方案。：/对不起@耶斯雷尔你能看看这个问题吗？
display = print
final = out 

                                                                  measure_avg_A                                      measure_sum_B
group_id_A group_id_B stat_C                                                                                                      
1          1          aa      {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...  {'stat_A': {0: 120, 2: 122}, 'level_A': {0: 1,...
                      ab      {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...  {'stat_A': {1: 121}, 'level_A': {1: 2}, 'stat_...
2          2          ab      {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...  {'stat_A': {3: 123}, 'level_A': {3: 1}, 'stat_...

   stat_A  level_A  stat_B  level_B  measure_avg_A
0     120        1     220        1           10.5
2     122        2     222        2           20.0

   stat_A  level_A  stat_B  level_B  measure_avg_A
1     121        2     221        1           11.0

   stat_A  level_A  stat_B  level_B  measure_sum_B
0     120        1     220        1             10
2     122        2     222        2             30

   stat_A  level_A  stat_B  level_B  measure_sum_B
1     121        2     221        1             20