如何使用Python中的sklearn对模型进行单一预测?

如何使用Python中的sklearn对模型进行单一预测?,python,pandas,machine-learning,scikit-learn,Python,Pandas,Machine Learning,Scikit Learn,我使用sklearn在公司数据集上训练了一个机器学习模型。数据集具有以下属性:名称、域、成立年份、行业、规模范围、地区、国家、linkedin\u url、当前员工估计数、总员工估计数 我想训练一个机器学习模型,尝试使用名称和年份属性预测规模范围值,该值根据公司规模分为八类。我已使用以下培训代码完成此操作: from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEn

我使用sklearn在公司数据集上训练了一个机器学习模型。数据集具有以下属性:
名称、域、成立年份、行业、规模范围、地区、国家、linkedin\u url、当前员工估计数、总员工估计数

我想训练一个机器学习模型,尝试使用
名称
年份
属性预测
规模
范围值,该值根据公司规模分为八类。我已使用以下培训代码完成此操作:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import logistic
from tools import pickleFile
from tools import unpickleFile
from tools import cleanDataset
from tools import getPrettyTimestamp
import sklearn
import pandas as pd
import numpy as np
import datetime
import sys


def train_model(clf, X_train, y_train, epochs=10):
    """
    Trains a specific model and returns a list of results

    :param clf: sklearn model
    :param X_train: encoded training data (attributes)
    :param y_train: training data (attribute to predict
    :param epochs: number of iterations (default=10)
    :return: result (accuracy) for this training data
    """
    scores = []
    print("Starting training...")
    for i in range(1, epochs + 1):
        print("Epoch:" + str(i) + "/" + str(epochs) + " -- " + str(datetime.datetime.now()))
        clf.fit(X_train, y_train)
        score = clf.score(X_train, y_train)
        scores.append(score)
    print("Done training.  The score(s) is/are: " + str(scores))
    return scores

def main():

    # Parse the arguments.
    userRequestedTrain, filename = parseArgs()

    # Some custom Pandas settings - TODO remove this
    pd.set_option('display.max_columns', 30)
    pd.set_option('display.max_rows', 1000)

    dataset = pd.read_csv("companies_sorted.csv", nrows=50000)


    origLen = len(dataset)
    print(origLen)

    dataset = cleanDataset(dataset)

    cleanLen = len(dataset)
    print(cleanLen)

    print("\n======= Some Dataset Info =======\n")
    print("Dataset size (original):\t" + str(origLen))
    print("Dataset size (cleaned):\t" + str(len(dataset)))
    print("\nValues of size_range:\n")
    print(dataset['size_range'].value_counts())
    print()

    # size_range is the attribute to be predicted, so we pop it from the dataset
    sizeRange = dataset.pop("size_range").values

    # We split our dataset and attribute-to-be-preditcted into training and testing subsets.
    xTrain, xTest, yTrain, yTest = train_test_split(dataset, sizeRange, test_size=0.25, random_state=1)


    print(xTrain.transpose())
    le = LabelEncoder()
    ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

    # Our feature set, i.e. the inputs to our machine-learning model.
    featureSet = ['name', 'year_founded']

    # Making a copy of test and train sets with only the columns we want.
    xTrain_sf = xTrain[featureSet].copy()
    xTest_sf = xTest[featureSet].copy()

    # Apply one-hot encoding to columns
    ohe.fit(xTrain_sf)

    print(xTrain_sf)
    print(xTest_sf)

    featureNames = ohe.get_feature_names()

    # Encoding test and train sets
    xTrain_sf_encoded = ohe.transform(xTrain_sf)
    xTest_sf_encoded = ohe.transform(xTest_sf)

    # ------ Using Logistic Regression classifier - TRAINING PHASE ------

    if userRequestedTrain:
        # We define the model we're going to use.
        lrModel = LogisticRegression(solver='lbfgs', multi_class="multinomial", max_iter=1000, random_state=1)

        # Now, let's train it.
        lrScores = train_model(lrModel, xTrain_sf_encoded, yTrain, 1)

        # Save the model as a file.
        filename = "models/Model_" + getPrettyTimestamp()
        print("Training done! Pickling model to " + str(filename) + "...")
        pickleFile(lrModel, filename)

    # Reload the model for testing.  If we didn't train the model ourselves, then it was specified as an argument.
    lrModel = unpickleFile(filename)

    PRED = lrModel.predict(xTrain_sf_encoded[0:10])

    print("Unpickled successfully from file " + str(filename))

    # ------- TESTING PHASE -------

    testLrScores = train_model(lrModel, xTest_sf_encoded, yTest, 1)

    if userRequestedTrain:
        trainScore = lrScores[0]
    else:
        trainScore = 0.9201578143173162  # Modal training score - substitute if we didn't train model ourselves

    testScore = testLrScores[0]

    scores = sorted([(trainScore, 'train'), (testScore, 'test')], key=lambda x: x[0], reverse=True)
    better_score = scores[0]  # largest score
    print(scores)

    # Which score was better?
    print("Better score: %s" % "{}".format(better_score))

    print("Pickling....")

    pickleFile(lrModel, "models/TESTING_" + getPrettyTimestamp())
此代码运行成功-培训和测试阶段已完成,测试阶段的准确率约为60%:

Starting training...
Epoch:1/1 -- 2019-12-18 20:03:13.462479
Done training.  The score(s) is/are: [0.8854667949951877]
Training done! Pickling model to models/Model_2019-12-18_2003...
Unpickled successfully from file models/Model_2019-12-18_2003
= = = = = = = = = = = = = = = = = = = 

First 10 predictions:

['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
 = = = = = = = = = = = = = 
Starting training...
Epoch:1/1 -- 2019-12-18 20:03:20.775392
Done training.  The score(s) is/are: [0.5906466512702079]
[(0.8854667949951877, 'train'), (0.5906466512702079, 'test')]
Better score: (0.8854667949951877, 'train')
Pickling....

Process finished with exit code 0
但是,假设我想用这个模型做一个单一的预测,也就是说,通过传递公司名称和公司成立年份。我做了以下工作:

lrModel = pickle.load(open(filename, 'rb'))
predictedSet = lrModel.predict([["SomeRandomCompany", 2019]])

但当我这样做时,我得到以下值错误:

  X = check_array(X, accept_sparse='csr')
Traceback (most recent call last):
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 85, in <module>
    main()
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 58, in main
    predictions(model, reducedSetEncoded, reducedSet)
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 80, in predictions
    predictedSet = lrModel.predict([["SomeCompany", 2019]])
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 293, in predict
    scores = self.decision_function(X)
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 272, in decision_function
    raise ValueError("X has %d features per sample; expecting %d"
ValueError: X has 2 features per sample; expecting 54897
X=check\u数组(X,accept\u sparse='csr')
回溯(最近一次呼叫最后一次):
文件“/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py”,第85行,在
main()
文件“/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py”,第58行,主文件
预测(模型、约简集编码、约简集)
文件“/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py”,第80行,在预测中
predictedSet=lrModel.predict([[“SomeCompany”,2019]]
文件“/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site packages/sklearn/linear_model/_base.py”,第293行,在predict中
分数=自我决策函数(X)
文件“/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site packages/sklearn/linear\u model/\u base.py”,第272行,在决策函数中
raise VALUERROR(“X每个样本有%d个功能;应为%d个”
ValueError:X每个示例有2个功能;预期为54897

它似乎需要一个与训练数据集形状完全相同的数据集,即11000行的数据集。它可以在问题的测试阶段给出很好的预测,因此很明显,模型能够做出很好的预测。我如何让它仅基于一个值进行预测,如上图所示?

当训练mod时el如果您使用的数据集具有N个特征,那么模型也需要相同数量的特征进行预测。因为您的模型通过查看这些N个特征进行训练并进行预测,所以它需要相同的维度。这就是为什么您得到的X每个样本有2个特征;期望54897错误


您可以做的一件事是创建带有与所需维度(N)匹配的零的矩阵或df和填充用于预测df精确位置的值。

当您使用具有N个特征的数据集训练模型时,模型也期望相同数量的特征用于预测。因为您的模型通过查看这些N个特征进行训练并进行预测,所以它需要相同的尺寸。这就是为什么X有2个f每个样本的特征;预期为54897错误


你可以做的一件事是创建一个矩阵或df,其中的零与所需的尺寸(N)相匹配,填充值用于预测df的准确位置。

我认为你应该仔细检查用于训练的df:
xTrain\u sf\u encoded
,它应该是一个2列数据帧,但出于某种原因,它有54987

还有一件事,你为什么在测试阶段这么做

您正在重新培训模型,而我相信您会像这样对其进行测试:

# Print Predictions
yPred = lrModel.predict(xTest_sf_encoded)
print(yPred)
# Print the actual values
print(yTest)
# Compare
print(yPred==yTest)

我认为您应该仔细检查用于培训的df:
xTrain\u sf\u encoded
,它应该是一个2列数据帧,而出于某种原因,它有54987

还有一件事,你为什么在测试阶段这么做

您正在重新培训模型,而我相信您会像这样对其进行测试:

# Print Predictions
yPred = lrModel.predict(xTest_sf_encoded)
print(yPred)
# Print the actual values
print(yTest)
# Compare
print(yPred==yTest)

我只是好奇,如果只有这两个功能,而所有其他功能都是零,那么预测会是什么。如果这项工作,意味着所有其他功能都是无用的,那么就没有必要训练它。我不明白的是,我训练模型时使用了两个功能,而不是54987个功能;我只使用了
name
year\u
.通过传递公司名称(字符串)和年份(int),真的没有办法从模型中得到一个预测吗?我必须创建一个与用于培训的数据框大小相同的空白数据框,并用值重新放置第一个条目?@ivorysoap是的,相同的列大小。行可以不同。这是一种从模型中获取预测的方法。如果在测试se中,您的培训集的
名称
年份
在第一列和第二列中我不确定第一列和第二列是
name
year\u founded
,其他列是零。我只是想知道,如果这是仅有的两个功能,而其他所有功能都是零,那么预测会是什么。如果这项工作成功,意味着所有其他功能都没有用,就没有必要对其进行培训。我不明白的是这一点我用2个特征训练模型,而不是54987个特征;我只使用了
name
year\u founded
。通过传递公司名称(字符串)和年份(int),真的没有办法从模型中得到一个预测吗?我必须创建一个与用于培训的数据框大小相同的空白数据框,并用值重新放置第一个条目?@ivorysoap是的,相同的列大小。行可以不同。这是一种从模型中获取预测的方法。如果在测试se中,您的培训集的
名称
年份
在第一列和第二列中t确保第一列和第二列为
名称
年份
,其他列为零。