Python 外/内环组之间成对比较的双熊猫分组操作_Python_Pandas_Pandas Groupby

Python 外/内环组之间成对比较的双熊猫分组操作

python pandas

Python 外/内环组之间成对比较的双熊猫分组操作,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我正试着做一个有点复杂的分组操作。下面是一些功能性但速度较慢的代码 #构建一个玩具数据框 idx1=[“bar”、“baz”、“foo”] idx2=列表（范围（100104）） idx3=列表（范围（3）） num_data=len（idx1）*len（idx2）*len（idx3） index=pd.MultiIndex.from_乘积（（idx1，idx2，idx3），name=[“第一”，“第三”，“第四”]） np.random.seed（0） x=np.random.randint（

我正试着做一个有点复杂的分组操作。下面是一些功能性但速度较慢的代码

#构建一个玩具数据框
idx1=[“bar”、“baz”、“foo”]
idx2=列表（范围（100104））
idx3=列表（范围（3））
num_data=len（idx1）*len（idx2）*len（idx3）
index=pd.MultiIndex.from_乘积（（idx1，idx2，idx3），name=[“第一”，“第三”，“第四”]）
np.random.seed（0）
x=np.random.randint（低=0，高=2，大小=num\u数据，数据类型=bool）
input_df=pd.DataFrame（index=index，data={“x”：x}）.reset_index（）
输入_df[“秒”]=“正”
输入测向[“第二”][输入测向[“第三”！=100]=“负”
输入测向[“第三”][输入测向[“第三”]==101]=100
#并发症：并非所有按“第四”分组的组都有相同的指数。大多数索引将由Most共享
#“第四”组，但交叉点不完整。
掩码=np.ones（num_数据，dtype=bool）
掩码[[17,18]]=False
input_df=input_df[掩码]
input_df=input_df.设置索引（[“第一”、“第二”、“第三”、“第四”]）

input_df

如下所示：

                                 x
first second   third fourth
bar   positive 100   0       False
                     1       False
                     2        True
      negative 100   0        True
                     1       False
                     2        True
               102   0       False
                     1        True
                     2       False
               103   0        True
                     1       False
                     2        True
baz   positive 100   0       False
                     1       False
                     2       False
      negative 100   0       False # Notice some missing rows here
                     1        True
               102   1        True
                     2        True
               103   0        True
                     1        True
                     2       False
foo   positive 100   0       False
                     1       False
                     2        True
      negative 100   0        True
                     1       False
                     2       False
               102   0       False
                     1        True
                     2        True
               103   0        True
                     1        True
                     2        True

数据帧保证/属性：

在每个“第一组”中，总有一个积极的“第三组”
每个“第一组”中有N个（可变）负“第三组”

我想高效地做的是：

对于每个“第一”组：
- 将所有负“第三”组与单个正“第三”组进行比较（参见代码了解“比较”的含义）

dfs=[]
#对于每个“第一”组：
对于first，input_df.groupby中的first_df（“first”）：
#将阳性组和阴性组分开
正掩码=第一个索引。获取级别值（“第二个”）=“正”
first_df=first_df.液滴液位（[“first”]）
正折射率=第一折射率[正折射率]
负的_dfs=第一个_df[~正的_掩码]
正测向=正测向液滴液位（[“第二”、“第三”]）
#对每个负“第三”组及其对应的正组进行一些计算。
对于第三个，负dfs.groupby中的负df（“第三个”）：
负折射率=负折射率液滴水平（[“第二”、“第三”]）
#仅根据“第四”指数比较阳性/阴性组
#请注意，对于不在交点处的索引，将指定“False”。
真=负方向[“x”]和正方向[“x”]
真假=负方向[“x”]&正方向[“x”]
false\u false=~负的\u df[“x”]&正的\u df[“x”]
false\u true=~负的\u df[“x”]和正的\u df[“x”]
df=pd.DataFrame({
“真的”：真的，
“真假”：真假，
“假假”：假假假，
“false\u true”：false\u true
}).reset_index（）
df[“第一”]=第一
df[“秒”]=“负”
df[“第三”]=第三
dfs.append（df）
#输出：所有负“第三”组的计算值的大数据帧。
输出_df=pd.concat（dfs）
output_df=output_df.设置索引（[“第一”、“第二”、“第三”、“第四”]，验证完整性=真）。排序索引（）

这意味着，

output\u df

如下所示。请注意，在原始数据帧中缺少“第四个”索引的地方，所有行都为false

                             true_true  true_false  false_false  false_true
first second   third fourth
bar   negative 100   0           False        True        False       False
                     1           False       False         True       False
                     2            True       False        False       False
               102   0           False       False         True       False
                     1           False        True        False       False
                     2           False       False        False        True
               103   0           False        True        False       False
                     1           False       False         True       False
                     2            True       False        False       False
baz   negative 100   0           False       False         True       False
                     1           False        True        False       False
                     2           False       False        False       False # All false from missing data
               102   0           False       False        False       False # All false from missing data
                     1           False        True        False       False
                     2           False        True        False       False
               103   0           False        True        False       False
                     1           False        True        False       False
                     2           False       False         True       False
foo   negative 100   0           False        True        False       False
                     1           False       False         True       False
                     2           False       False        False        True
               102   0           False       False         True       False
                     1           False        True        False       False
                     2            True       False        False       False
               103   0           False        True        False       False
                     1           False        True        False       False
                     2            True       False        False       False

在循环中执行此操作的速度非常慢：（不仅是循环本身，而且分析显示，由于索引必须对齐，因此在执行内部循环比较操作时花费了大量时间

有没有一种更有效的方法来执行这种计算，也许不需要太多的循环

编辑：为确定性示例数据和更新的输入/输出数据添加随机种子。

您可以尝试此方法，但不确定是否更有效：

dfi = input_df['x'].unstack(level=['second','fourth'])

dfi.update(dfi.groupby('first').ffill()[['positive']])

dfi = dfi.stack()
neg_nulls = dfi['negative'].isna()
pos_nulls = dfi['positive'].isna()    
dfi = dfi.fillna(False)
    
dfi['true_true'] = dfi["negative"] & dfi["positive"] 
dfi['true_false'] =  dfi["negative"] & ~dfi["positive"]
dfi['false_false'] =  ~dfi["negative"] & ~dfi["positive"]
dfi['false_true'] =  ~dfi["negative"] & dfi["positive"]
dfi[neg_nulls] = False
dfi[pos_nulls] = False
    
df_out = dfi.rename_axis([None], axis=1)\
   .assign(second='negative')\
   .set_index('second', append=True)\
   .reorder_levels([0,3,1,2])\
   .drop(['positive', 'negative'], axis=1)

带计时的输出（使用np.random.seed（0）更新）：

计时

每个回路21.8 ms±549µs（7次运行的平均值±标准偏差，每个10个回路）
每个回路50.2 ms±2.91 ms（7次运行的平均值±标准偏差，每个10个回路）

细节

重新调整输入数据框的形状，使正片和负片并排 “第四”
每“三分之一”填写正向数据并更新数据帧
重塑形状，堆叠“四分之一”，使其旁边有一列正值底片
应用真-真…假-假逻辑
为缺失的负片和正片设置所有FALSE
重塑以获得所需的输出数据帧

（无循环，但对数据帧进行了相当大的重塑）

请在共享数据中添加一个

np.random.seed（一些数字）

，以便reproducible@sammywemmy谢谢，很好的建议，Doneth这真的很重要；-）@cydonian这个解决方案对你有帮助吗？你的数据运行速度快了一倍吗？@ScottBoston还在计算。我可能需要更新问题-在我的真实数据中，“第四个”索引只在每个“第一个”组中匹配，因此第一步拆下数据框会给出O（N_firsts）列。我将看看是否可以修改您的答案并发布结果。@ScottBoston此答案对于我提出的问题非常适合，因此选择正确。还有一些额外的输入数据特征使得这个答案不适用于我的特定用例，即每个“第一”组从不同的集合中提取其“第三”和“第四”索引。我试图找出一种方法来适应你的答案，但是，唉，我是个笨蛋。我想重新问修改过的问题会更干净，我在这里做过：

                             true_true  true_false  false_false  false_true
first second   third fourth                                                
bar   negative 100   0           False        True        False       False
                     1           False       False         True       False
                     2            True       False        False       False
               102   0           False       False         True       False
                     1           False        True        False       False
                     2           False       False        False        True
               103   0           False        True        False       False
                     1           False       False         True       False
                     2            True       False        False       False
baz   negative 100   0           False       False         True       False
                     1           False        True        False       False
                     2           False       False        False       False
               102   0           False       False        False       False
                     1           False        True        False       False
                     2           False        True        False       False
               103   0           False        True        False       False
                     1           False        True        False       False
                     2           False       False         True       False
foo   negative 100   0           False        True        False       False
                     1           False       False         True       False
                     2           False       False        False        True
               102   0           False       False         True       False
                     1           False        True        False       False
                     2            True       False        False       False
               103   0           False        True        False       False
                     1           False        True        False       False
                     2            True       False        False       False

all(df_out==output_df)
True