Python 高效地计算包含值大于百分位的列名的dataframe列_Python_Pandas_Dataframe_Indexing

Python 高效地计算包含值大于百分位的列名的dataframe列

python pandas dataframe indexing

Python 高效地计算包含值大于百分位的列名的dataframe列,python,pandas,dataframe,indexing,Python,Pandas,Dataframe,Indexing,我有一个包含大量整数列的数据帧 df id col_1 col_2 ... col_n 0 1 21 120 1 2 42 23 2 55 16 54 3 4 48 12 4 12 100 75 5 6 52

我有一个包含大量整数列的数据帧

df

id    col_1    col_2   ...    col_n
0     1        21             120 
1     2        42             23
2     55       16             54
3     4        48             12
4     12       100            75
5     6        52             64

并希望生成一个附加列（例如

my_col

），其中包含第90百分位以上的值的列名

df

id    col_1    col_2   ...    col_n    my_col
0     1        21             120      [col_n] # because 120 is above the 90th percentile of values in col_n
1     2        42             23       [] # empty list because no values are above the 90th percentile in their respective cols
2     55       16             120      [col_1, col_n] # because 55 is above the 90th percentile in col_1, and 120 is in col_n
3     4        48             12       []
4     12       100            75       [col_2]
5     6        52             64       []

首先，我为所有列创建了一个具有第90百分位的数据框：

#cols = my column names list
#transposing to pretty print it a bit more nicely, don't think it's strictly necessary
df_p90 = df[cols].quantile([0.90]).transpose()

然后我定义了一个自定义函数来计算所需的列列表：

def f(row, df_quantiles, in_cols):
    col_list = []
    for col in in_cols:
        if row[col] > df_quantiles.at[col, 0.90]:
            col_list.append(col)
    return col_list

并将其应用于我的数据帧

df["my_col"] = df.apply(f, args=(df_p90, cols), axis=1)

代码工作正常，但在大型数据帧（大约200000行和2000列）上运行时，运行速度非常慢。我很确定这是因为我通过for循环定义了

，并使用

at

直接查找。我不能真正地“思考数据帧”，我非常严格地说“一切都是为了循环，如果是那么ELSE”

我怎样才能做得更好呢？

我们可以试试

dot

df=df.set_index('id')
s=df.gt(df.quantile(.90)).dot(df.columns+',').str[:-1].str.split(',')
df['c']=s

这项工作：

top_x_values = df.quantile(.9)
top_x_bool = df > top_x_values # a true / false dataframe. 
for col in top_x_bool:
    top_x_bool[col].replace(True, col, inplace=True) # replace true with col name

# Join col names to create the list you're looking for. 
df["col_list"] = top_x_bool.agg(lambda columns: ",".join(col for col in columns if col), axis=1)

以下是输出：

    col_1   col_2   col_n   col_list
0   1       21      120 
1   2       42      23  
2   55      16      54      col_1
3   4       48      12  
4   12      100     200     col_2,col_n
5   6       52      64

也许可以给你的专栏排序。然后，一旦你发现一列不符合第90百分位，你就可以丢弃剩下的未经检查的列。答案很好！没有想到把

gt

和

dot

结合起来。我不能很好地解析最后一部分的语法，它看起来像是剪切了最后一个字符（当您使用逗号连接时，它可能是一个添加的空格），然后拆分字符串。然而，若p90以上并没有列，则返回一个带有空字符串的列表（而不是一个空列表）。我用

.apply（lambda x:len（x））

算出了它。应该把

.strip（）

放在什么地方？