Python 我如何在训练和测试数据中对齐熊猫模型？_Python_Pandas

Python 我如何在训练和测试数据中对齐熊猫模型？

python pandas

Python 我如何在训练和测试数据中对齐熊猫模型？,python,pandas,Python,Pandas,这有助于认识到我可以分割培训和验证数据。这是我用来加载我的列车和测试的代码 def load_data(datafile): training_data = pd.read_csv(datafile, header=0, low_memory=False) training_y = training_data[['job_performance']] training_x = training_data.drop(['job_performance'], axis=1)

这有助于认识到我可以分割培训和验证数据。这是我用来加载我的列车和测试的代码

def load_data(datafile):
    training_data = pd.read_csv(datafile, header=0, low_memory=False)
    training_y = training_data[['job_performance']]
    training_x = training_data.drop(['job_performance'], axis=1)

    training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    training_x.fillna(training_x.mean(), inplace=True)
    training_x.fillna(0, inplace=True)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns

    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y

其中

数据文件

是我的培训文件。我还有另一个文件，

test.csv

，它的列与培训文件相同，只是可能缺少类别。如何在测试文件中执行

获取虚拟对象，并确保以相同的方式对类别进行编码
此外，我的测试数据缺少job\u performance
列，如何在函数中处理此问题？
处理多个列时，最好使用sklearn.preprocessing.onehotcoder
。这有助于跟踪您的类别，并优雅地处理未知类别
sys.version
# '3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]'
sklearn.__version__
# '0.20.0'
np.__version__
# '1.15.0'
pd.__version__
# '0.24.2'


您可以查看类别及其顺序：
 ohe.categories_
# [array(['a', 'b', 'c'], dtype=object),
#  array(['bird', 'cat', 'dog'], dtype=object)]

现在，为了扭转这个过程，我们只需要以前的分类。无需在此处对任何模型进行pickle或unpickle
df2 = pd.DataFrame({
    'data': [1, 2, 1],
    'cat1': ['b', 'a', 'b'],
    'cat2': ['cat', 'dog', 'cat']
})

ohe2 = OneHotEncoder(categories=ohe.categories_)
ohe2.fit_transform(df2[categorical_columns])

dummies = pd.DataFrame(ohe2.fit_transform(df2[categorical_columns]).toarray(), 
                       index=df2.index, 
                       dtype=int)
pd.concat([df2.drop(categorical_columns, axis=1), dummies], axis=1)

   data  0  1  2  3  4  5
0     1  0  1  0  0  1  0
1     2  1  0  0  0  0  1
2     1  0  1  0  0  1  0


那么这对您意味着什么呢？您需要更改您的函数以同时处理训练和测试数据。在函数中添加一个额外的参数categories

def load_data(datafile, categories=None):
    data = pd.read_csv(datafile, header=0, low_memory=False)
    if 'job_performance' in data.keys():
        data_y = data[['job_performance']]
        data_x = data.drop(['job_performance'], axis=1)
    else:
        data_x = data
        data_y = None

    data_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    data_x.fillna(data_x.mean(), inplace=True)
    data_x.fillna(0, inplace=True)

    ohe = OneHotEncoder(handle_unknown='ignore', 
                        categories=categories if categories else 'auto')

    categorical_data = data_x.select_dtypes(object)
    dummies = pd.DataFrame(
                ohe.fit_transform(categorical_data.astype(str)).toarray(), 
                index=data_x.index,
                dtype=int)

    data_x = pd.concat([
        data_x.drop(categorical_data.columns, axis=1), dummies], axis=1)

    return (data_x, data_y) + ((ohe.categories_, ) if not categories else ())
代码应该可以正常工作。
如果您想使用pandas get_Dummie，您需要手动为train中的值添加列，但不在test中，并忽略test中的列，但不在train中
您可以使用dummies列名（默认情况下为“origcolumn_值”）来完成此操作，并使用单独的函数进行训练和测试
def load_data(datafile):
    training_data = pd.read_csv(datafile, header=0, low_memory=False)
    training_y = training_data[['job_performance']]
    training_x = training_data.drop(['job_performance'], axis=1)

    training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    training_x.fillna(training_x.mean(), inplace=True)
    training_x.fillna(0, inplace=True)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns

    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y

以下内容（尚未测试）：
def load_和_clean（数据文件路径，标记为False）：
data=pd.read\u csv（数据文件路径，头=0，低内存=False）
如果有标签：
作业绩效=数据[“作业绩效”]
data=data.drop（['job\u performance']，axis=1）
data.replace（[np.inf，-np.inf]，np.nan，inplace=True）
data.fillna（data.mean（），inplace=True）
data.fillna（0，inplace=True）
如果有标签：
数据['job_performance']=job_performance
返回数据
def假人训练（训练数据）：
培训y=培训数据[[工作绩效]]
培训x=数据。删除（['job_performance'，axis=1）
分类\u数据=培训\u x.选择\u数据类型(
include=['category'，object]）。列
training\u x=pd.get\u假人（training\u x，columns=分类数据）
返回培训列、培训列、培训列
def假人测试（测试数据、模型列）：
分类数据=测试数据。选择数据类型(
include=['category'，object]）。列
test\u data=pd.get\u假人（test\u data，columns=categorical\u数据）
对于模型_列中的c：
如果c不在test_data.columns中：
测试数据[c]=0
返回测试数据[模型列]
training_x，training_y，model_columns=假人_train（加载_和_clean（），标记为True）
test_x=假人测试（加载和清洁（），模型列）
如果不在测试中，为什么要在培训中使用工作绩效
？我不想在培训中使用它。但是我需要让假人对齐。你的问题是假人对齐，这很好。工作表现似乎是一个无关的问题？你可以用if语句来处理这个问题，对吗？我已经重写了你的问题，以强调编码一致性这一更重要的问题。ValueError:无法将字符串转换为float:'bird'@pyd:对我有效，抱歉。我已经更新了我的模块版本。让我知道有什么不同。@pyd您可能会觉得有用。您可能正在运行旧版本的sklearn，其中OHE只接受整数标签。我想这在以后的版本中已经改变了。请升级，这样更容易。我还遇到了OHE
的问题：TypeError:参数必须是字符串或数字
。由于某种原因，我无法升级sklearn
，因为它的版本是0.0
。我正在使用pipenv如果matters@Shamoon查看升级pipenv内部模块的链接。您可能遇到了与pyd相同的问题！这仍然不能保证对齐，因为它不能解释测试中看不见的类别。您还需要确保测试列顺序相同。您认为dummies_test func中的“测试数据[model_columns]”应该解决这个问题
# Load training data.
X_train, y_train, categories = load_data('train.csv')
...
# Load test data.
X_test, y_test = load_data('test.csv', categories=categories)

def load_and_clean(datafile_path, labeled=False):
    data = pd.read_csv(datafile_path, header=0, low_memory=False)

    if labeled:
        job_performance = data['job_performance']
        data = data.drop(['job_performance'], axis=1)

    data.replace([np.inf, -np.inf], np.nan, inplace=True)
    data.fillna(data.mean(), inplace=True)
    data.fillna(0, inplace=True)

    if labeled:
        data['job_performance'] = job_performance

    return data

def dummies_train(training_data):
    training_y = training_data[['job_performance']]
    training_x = data.drop(['job_performance'], axis=1)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns
    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y, training_x.columns

def dummies_test(test_data, model_columns):
    categorical_data = test_data.select_dtypes(
        include=['category', object]).columns
    test_data = pd.get_dummies(test_data, columns=categorical_data)
    for c in model_columns:
        if c not in test_data.columns:
            test_data[c] = 0
    return test_data[model_columns]

training_x, training_y, model_columns = dummies_train(load_and_clean(<train_data_path>), labeled=True)
test_x = dummies_test(load_and_clean(<test_data_path>), model_columns)