Python Dataframe:创建新的Dataframe,并在出现超过2次的多个列的基础上保持重复(时间限制)
此ia示例代码:Python Dataframe:创建新的Dataframe,并在出现超过2次的多个列的基础上保持重复(时间限制),python,pandas,dataframe,group-by,Python,Pandas,Dataframe,Group By,此ia示例代码: data = {'Col1': [0,1,2,3,4,5,6,7,8,9,10,11,12], 'Col2': ['A1','A1','A1','A2','A2','A3','A4','A4','A4','A1','a9','a9','A2'], 'Col3': ['B1','B1','B1','B2','B3','B4','B5','B5','B5','B1','b9','b9','B2'], 'Col
data = {'Col1': [0,1,2,3,4,5,6,7,8,9,10,11,12],
'Col2': ['A1','A1','A1','A2','A2','A3','A4','A4','A4','A1','a9','a9','A2'],
'Col3': ['B1','B1','B1','B2','B3','B4','B5','B5','B5','B1','b9','b9','B2'],
'Col4': ['ab','bc','cd','da','da','da','df','fd','vf','sd','asd','sda','sdf'],
}
df2 = pd.DataFrame (data)
counts_col2 = df2.groupby("Col2")["Col2"].transform(len)
counts_col3 = df2.groupby("Col3")["Col3"].transform(len)
mask = (counts_col2 > 2) & (counts_col3 > 2)
df2[mask]
输出
Col1 Col2 Col3 Col4
0 0 A1 B1 ab
1 1 A1 B1 bc
2 2 A1 B1 cd
6 6 A4 B5 df
7 7 A4 B5 fd
8 8 A4 B5 vf
9 9 A1 B1 sd
agg(len)
您应该使用大小
进行聚合,这样可以快速实现。这将给您带来巨大的性能提升,之后您可以在groupby
中挤出更多的指定sort=False
,因此将行更改为:
counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
计时/等效示例:
另一个好方法是和
从@aLollz answer获取数据
%%timeit
df2[df2['Col2'].map(df2['Col2'].value_counts()>2) &
df2['Col3'].map(df2['Col3'].value_counts()>2)]
31.2 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df2[(df2.groupby("Col2")["Col2"].transform('size')>2) &
(df2.groupby("Col3")["Col3"].transform('size')>2)]
40.5 ms ± 981 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
counts_col2.values take 136 seconds\u col2.to_numpy()take 136 seconds您提供的任何更有效的方法都将影响result@ZestyDragonques否,因为熊猫在使用布尔掩码时仍将在索引上对齐。但对于许多团体来说,它开始对结果产生明显的影响。同样在上面,因为我使用了
equality\u check=np.allclose
,这确保了所有方法的结果基本相同。
counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
import perfplot
import pandas as pd
import numpy
def agg_len(df):
counts_col2 = df.groupby("Col2")["Col2"].transform(len)
counts_col3 = df.groupby("Col3")["Col3"].transform(len)
mask = (counts_col2 > 2) & (counts_col3 > 2)
return df[mask]
def agg_size(df):
counts_col2 = df.groupby("Col2")["Col2"].transform('size')
counts_col3 = df.groupby("Col3")["Col3"].transform('size')
mask = (counts_col2 > 2) & (counts_col3 > 2)
return df[mask]
def agg_size_nosort(df):
counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
mask = (counts_col2 > 2) & (counts_col3 > 2)
return df[mask]
#@ansev's solution
def map_value_counts(df)
return df[df['Col2'].map(df['Col2'].value_counts()>2) &
df['Col3'].map(df['Col3'].value_counts()>2)]
perfplot.show(
setup=lambda N: pd.DataFrame({'Col1': range(N),
'Col2': np.random.choice(np.arange(N), N),
'Col3': np.random.choice(np.arange(N), N),
'Col4': np.random.choice(np.arange(N), N)}),
kernels=[
lambda df: agg_len(df),
lambda df: agg_size(df),
lambda df: agg_size_nosort(df),
lambda df: map_value_counts(df)
],
labels=['Agg len', 'Agg size', 'Agg size No Sort', 'Map Value Counts'],
n_range=[2 ** k for k in range(16)],
equality_check=np.allclose,
xlabel="~ Number of Groups"
)
df2[df2['Col2'].map(df2['Col2'].value_counts()>2) &
df2['Col3'].map(df2['Col3'].value_counts()>2)]
%%timeit
df2[df2['Col2'].map(df2['Col2'].value_counts()>2) &
df2['Col3'].map(df2['Col3'].value_counts()>2)]
31.2 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df2[(df2.groupby("Col2")["Col2"].transform('size')>2) &
(df2.groupby("Col3")["Col3"].transform('size')>2)]
40.5 ms ± 981 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)