Python “如何看多远”；关「；我的机器学习预测是什么？_Python_Machine Learning_Scikit Learn

Python “如何看多远”；关「；我的机器学习预测是什么？

python machine-learning scikit-learn

Python “如何看多远”；关「；我的机器学习预测是什么？,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,所以我基本上得到了一个机器学习算法，准确率为20% 这不是很高，但我想知道我的算法平均有多接近因此，如果它预测的值为69，测试数据中的实际值为68，并且在整个过程中都有误差，那么我可以将其用于我使用它的目的，即在数据集中填充缺失的数据有什么简单的方法可以做到这一点吗我的代码片段： def predict_score_industry(df): coi = ['score_teaching', 'score_research', 'sco

所以我基本上得到了一个机器学习算法，准确率为20%

这不是很高，但我想知道我的算法平均有多接近

因此，如果它预测的值为69，测试数据中的实际值为68，并且在整个过程中都有误差，那么我可以将其用于我使用它的目的，即在数据集中填充缺失的数据

有什么简单的方法可以做到这一点吗

我的代码片段：

def predict_score_industry(df):

    coi = ['score_teaching',
           'score_research',
           'score_citation',
           'score_int_outlook',
           ]

    not_nans = df['score_industry'].notnull()
    df_notnans = df[not_nans]

    x = np.array(df_notnans[coi])
    y = np.array(df_notnans['score_industry'])

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

    clf = LinearRegression()
    clf.fit(x_train, y_train)

    print("score_industry: ", clf.score(x_test, y_test))

    df_nans = df.loc[~not_nans].copy()
    df_nans['score_industry'] = clf.predict(df_nans[coi])
    df.score_industry.fillna(df_nans.score_industry, inplace=True)

    return df

它基本上会获取所有未填充的值并预测它们，它读取的数据帧如下所示：

> print(df.info())
Data columns (total 15 columns):
university_name       2884 non-null object
country               2884 non-null object
ranking               2884 non-null int64
no_student            2884 non-null int64
no_student_p_staff    2884 non-null float64
pct_intl_student      2884 non-null float64
year                  2884 non-null int64
score_overall         2884 non-null float64
score_teaching        2884 non-null float64
score_research        2884 non-null float64
score_citation        2884 non-null float64
score_industry        2884 non-null float64
score_int_outlook     2884 non-null float64
male                  2884 non-null float64
female                2884 non-null float64

部分答案，希望能帮助你澄清一些问题

所以我基本上得到了一个机器学习算法，准确率为20%

由于您处于回归设置中，根据定义，您的分数不能是准确度，这仅在分类问题中有意义

我只是打印了x_测试和y_测试数组的分数

是的，您在scikit learn中实际使用了

线性回归

模型的

得分

方法，该方法返回（R^2）；从：

得分
（X，y，样本重量=无）

返回预测的确定系数R^2

系数R^2定义为（1-u/v），其中u是平方的剩余和（（y_真-y_pred）**2）。sum（）和v是总平方和（（y_真-y_真.mean（））**2.sum（）。最好的可能得分为1.0，并且可能为负值（因为模型可能为（更糟）。总是预测预期结果的常数模型不管输入特征如何，y的值将得到R^2分 0.0

可以说，R^2很少用于机器学习环境，其中要求是预测能力（主要由统计学家使用，其中要求通常是模型的解释力）

在回归设置中，什么指标是有价值的

（MSE）及其变量是回归问题中最常用的度量。检查scikit学习中可用于回归的分数。

你不能有“20%的准确率”（或任何其他数字），因为你处于回归设置中，且准确率仅在分类设置中有意义……哦，真的吗？我只是打印了x_测试和y_测试数组的分数。在回归设置中，什么指标是有价值的？