Google cloud platform XGboost Google AI模型期望浮点值,而不是使用分类值并转换它们

Google cloud platform XGboost Google AI模型期望浮点值,而不是使用分类值并转换它们,google-cloud-platform,scikit-learn,xgboost,Google Cloud Platform,Scikit Learn,Xgboost,我正试图使用这个简单的示例运行一个基于谷歌云的简单XGBoost预测 模型构建得很好,但当我尝试使用样本输入JSON运行预测时,它失败了,错误为“无法从输入初始化DMatrix:无法将字符串转换为浮点值”:“如下屏幕所示。我知道这是因为测试输入有字符串,我希望Google机器学习模型应该有将分类值转换为浮点值的信息。我不能期望我的用户提交带有浮点值的在线预测请求 根据教程,它应该在不将分类值转换为浮点值的情况下工作。请告知,我已附上更多细节的GIF。谢谢 可以使用pandas将分类字符串转换为

我正试图使用这个简单的示例运行一个基于谷歌云的简单XGBoost预测

模型构建得很好,但当我尝试使用样本输入JSON运行预测时,它失败了,错误为“无法从输入初始化DMatrix:无法将字符串转换为浮点值”:“如下屏幕所示。我知道这是因为测试输入有字符串,我希望Google机器学习模型应该有将分类值转换为浮点值的信息。我不能期望我的用户提交带有浮点值的在线预测请求

根据教程,它应该在不将分类值转换为浮点值的情况下工作。请告知,我已附上更多细节的GIF。谢谢


可以使用pandas将分类字符串转换为模型输入的代码。对于预测输入,您可以使用相应的类别值和代码为每个类别定义字典。例如,对于工人阶级:

df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
如果预测输入为“somestring”,则可以按如下方式访问其代码:

category_input = workclass_dict['somestring']

XGBoost模型将浮动作为输入。在培训脚本中,您将分类变量转换为数字。提交预测时需要进行相同的转换。

这里有一个修复。将Google文档中显示的输入放入一个文件
input.json
,然后运行这个。输出是
input\u numerical.json
,如果您使用它代替
input.json
,预测将成功

这段代码只是使用与训练和测试数据相同的过程将分类列预处理为数字形式

import json

import pandas as pd
from sklearn.preprocessing import LabelEncoder

COLUMNS = (
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income-level",
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
)

with open("./input.json", "r") as json_lines:
    rows = [json.loads(line) for line in json_lines]

prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))

encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    prediction_features[col] = encoders[col].fit_transform(prediction_features[col])

with open("input_numerical.json", "w") as input_numerical:
    for index, row in prediction_features.iterrows():
        input_numerical.write(row.to_json(orient="values") + "\n")
我创建的谷歌文档缺少这一重要步骤

import json

import pandas as pd
from sklearn.preprocessing import LabelEncoder

COLUMNS = (
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income-level",
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
)

with open("./input.json", "r") as json_lines:
    rows = [json.loads(line) for line in json_lines]

prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))

encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    prediction_features[col] = encoders[col].fit_transform(prediction_features[col])

with open("input_numerical.json", "w") as input_numerical:
    for index, row in prediction_features.iterrows():
        input_numerical.write(row.to_json(orient="values") + "\n")