Python 机器学习中哪个版本的重复特征列删除速度更快？为什么？_Python_Pandas

Python 机器学习中哪个版本的重复特征列删除速度更快？为什么？

python pandas

Python 机器学习中哪个版本的重复特征列删除速度更快？为什么？,python,pandas,Python,Pandas,我正在Udemy学习ML课程，目前正在阅读有关功能工程的内容。需要从数据集中删除重复的列（特性），作者建议使用两种版本的代码数据集下载第1版：版本1使用矩阵转置，然后应用duplicated（）方法，如下所示 data_unique = data.T.drop_duplicates(keep='first').T # check for duplicated features in the training set duplicated_feat = [] for i in range(

我正在Udemy学习ML课程，目前正在阅读有关功能工程的内容。需要从数据集中删除重复的列（特性），作者建议使用两种版本的代码

数据集下载

第1版：
版本1使用矩阵转置，然后应用

duplicated（）

方法，如下所示

data_unique = data.T.drop_duplicates(keep='first').T

# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)

这部分代码花了我的电脑大约9秒的时间才从350个重复的功能中找出52个。数据的形状是

（92500350）

，我的windows PC运行的是

双核i5、16GB和500GB SSD

运行时间：

9.71 s±299 ms/循环（7次运行的平均值±标准偏差，每个循环1次）

版本2：
讲师还建议了一种方法，如下所示

data_unique = data.T.drop_duplicates(keep='first').T

# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)

运行时间：

2min 16s±4.97s/循环（7次运行的平均值±标准偏差，每个循环1次）

最终，这花费了超过2分钟的时间来找出重复的功能，但讲师声称，如果我有大数据，这是一种更快的方法。尽管根据我的发现，我不相信他的说法。

最好的方法是使用

numpy

沿列找到唯一的索引（

axis=1

），然后切片原始索引

import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')

_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]

我的机器上的一些计时：

# First a sanity check they are equivalent (lucikly all values are non-Null)
(df_uniq == df.T.drop_duplicates(keep='first').T).all().all()
True

%%timeit 
_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]
#3.11 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.T.drop_duplicates(keep='first').T
#25.9 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我甚至都不会为这个循环而烦恼，因为它太糟糕了

使用不同的输入数据大小（以及多次运行）来验证您的发现如何？@Sparky05指在笔记本中的单元格顶部使用%%timeit，以便您自己计时。运行时指标将更新到问题中。感谢大家……所以，如果你的目标是弄清楚这两个版本是如何随时间扩展的，那么最好的办法就是测量它们是如何扩展的！更改数据集大小，确保避免琐碎的情况（如果您通过制作一个较小数据集的N个副本来创建一个较大的数据集，这可能会生成非代表性基准），重新测试并绘制关系。如果真的有一个大的常数因子偏移使得第二个在小数据集时变慢，但是更好的大O行为使得在大数据集时变好，它会出现在图表中。当然，你的导师可能真的错了——甚至可能是他们在提出课程时是对的，但是这

drop\u duplicates（）

已重写，以使用更高效的算法！我的版本1和版本2代码在您电脑中的报告时间是多少。版本1的@Samual 25.9s，在计时测试中。我没有耐心等待版本2，因为它太长了，而且不会打败NumPy版本