Python pd.get_dummies（）在大关卡上动作缓慢_Python_Pandas_Categorical Data

Python pd.get_dummies（）在大关卡上动作缓慢

python pandas

Python pd.get_dummies（）在大关卡上动作缓慢,python,pandas,categorical-data,Python,Pandas,Categorical Data,我不确定这是否已经是最快的方法，或者我这样做是否效率低下我想热编码一个特定的分类列，它有27k+可能的级别。该列在两个不同的数据集中具有不同的值，因此在使用get_dummies（）之前，我首先组合了这些级别然而，它已经运行了2个多小时，仍然停留在热编码状态我在这里会做错什么吗？或者这只是在大型数据集上运行它的本质 Df有6.8m行和27列，Df2有19990行和27列，然后对我想要的列进行热编码非常感谢您的建议，谢谢！：）我简要地回顾了一下，我认为它可能没有充分利用您的用例的稀疏性。

我不确定这是否已经是最快的方法，或者我这样做是否效率低下

我想热编码一个特定的分类列，它有27k+可能的级别。该列在两个不同的数据集中具有不同的值，因此在使用get_dummies（）之前，我首先组合了这些级别

然而，它已经运行了2个多小时，仍然停留在热编码状态

我在这里会做错什么吗？或者这只是在大型数据集上运行它的本质

Df有6.8m行和27列，Df2有19990行和27列，然后对我想要的列进行热编码

非常感谢您的建议，谢谢！：）

我简要地回顾了一下，我认为它可能没有充分利用您的用例的稀疏性。以下方法可能更快，但我没有尝试将其扩展到您拥有的1900万条记录：

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N = 10000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N)
    , 'col2b': np.random.choice([1, 2, 3], N)
    , 'target': np.random.choice([1, 2, 3], N)
    })

# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
    spmtx[np.argwhere(col1 == val), idx] = 1

# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
                           columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)

更新

借鉴其他答案的见解，我能够在两个维度上对解决方案进行矢量化。在我有限的测试中，我注意到构建SPARSTAFRAME似乎会将执行时间增加几倍。因此，如果不需要返回类似数据帧的对象，可以节省大量时间。此解决方案还处理需要将2+数据帧编码为具有相等列数的二维数组的情况

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N1 = 10000
N2 = 100000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N1)
    , 'col2a': np.random.choice([1, 2, 3], N1)
    , 'target': np.random.choice([1, 2, 3], N1)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N2)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
    , 'target': np.random.choice([1, 2, 3], N2)
    })

# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)


def sparse_ohe(df, col, vals):
    """One-hot encoder using a sparse ndarray."""
    colaray = df[col].values
    # construct a sparse matrix of the appropriate size and an appropriate,
    # memory-efficient dtype
    spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
    # do the encoding
    spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1

    # Construct a SparseDataFrame from the sparse matrix
    dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
                               columns=[col + '_' + str(el) for el in vals])
    dfnew.fillna(0, inplace=True)
    return dfnew

dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

例外：pass

总是错误的。我想你想要

如果df中的列名：

。至于你剩下的问题，你为什么不告诉我们哪一行花了很长时间？@JohnZwinck谢谢你的意见：）在这种情况下，我认为这并不重要，如果我错了，请纠正我。@JohnZwinck，正如我提到的，get_dummies（）需要很长时间IMO

CountVectorizer

是执行此任务的最佳选择。如果你能提供小的可复制数据集和所需的数据集，我可以写一个小演示…@MaxU我很想知道如何在数字数据上使用

countvectorier

。嘿，谢谢你的回答！：）这将如何处理第二个数据帧中的类别的记帐问题？再次您好！：）如果我理解正确，这只会将当前列作为稀疏数据帧返回，而不是将其合并到原始数据帧中，对吗？此外，我还得到了一个ValueError：在尝试输出时无法将当前fill_value nan强制为uint8 dtype。嗯，我无法重现ValueError。我正在使用pandas 0.20.1，它是最近才发布的。如果需要重新组合包含原始列的所有列（不包括一个热编码列）的完整数据帧，则可以在末尾添加以下语句：

dfa=pd.concat（[dfanew，dfa.drop（'col1'，axis=1）]，axis=1）

。如果ValueError持续存在，我怀疑从

SPARSTAFRAME

调用中删除

dtype=np.uint8

参数将解决此问题。该参数不是严格必需的。

np.nan

是一个np.float对象。

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N1 = 10000
N2 = 100000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N1)
    , 'col2a': np.random.choice([1, 2, 3], N1)
    , 'target': np.random.choice([1, 2, 3], N1)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N2)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
    , 'target': np.random.choice([1, 2, 3], N2)
    })

# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)


def sparse_ohe(df, col, vals):
    """One-hot encoder using a sparse ndarray."""
    colaray = df[col].values
    # construct a sparse matrix of the appropriate size and an appropriate,
    # memory-efficient dtype
    spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
    # do the encoding
    spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1

    # Construct a SparseDataFrame from the sparse matrix
    dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
                               columns=[col + '_' + str(el) for el in vals])
    dfnew.fillna(0, inplace=True)
    return dfnew

dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)