Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/345.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Bayes心脏预测,结果不准确_Python_Machine Learning_Scikit Learn_Classification_Naivebayes - Fatal编程技术网

Python Bayes心脏预测,结果不准确

Python Bayes心脏预测,结果不准确,python,machine-learning,scikit-learn,classification,naivebayes,Python,Machine Learning,Scikit Learn,Classification,Naivebayes,我正试图用朴素贝叶斯来做一个心脏病预测程序。当我完成分类器时,交叉验证显示平均准确率为80%,但是当我尝试对给定样本进行预测时,预测完全错误!该数据集是UCI存储库中的心脏病数据集,包含303个样本。有两个类0:Health和1:ill,当我尝试对数据集中的样本进行预测时,它不会预测其真实值,只有极少数样本除外。代码如下: import pandas as pd import numpy as np from sklearn.naive_bayes import GaussianNB from

我正试图用朴素贝叶斯来做一个心脏病预测程序。当我完成分类器时,交叉验证显示平均准确率为80%,但是当我尝试对给定样本进行预测时,预测完全错误!该数据集是UCI存储库中的心脏病数据集,包含303个样本。有两个类0:Health和1:ill,当我尝试对数据集中的样本进行预测时,它不会预测其真实值,只有极少数样本除外。代码如下:

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler


class Predict:
    def Read_Clean(self,dataset):
        header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',
                      'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',
                      'Exercise_Induced_Angina', 'OldPeak',
                      'Slope', 'CA', 'Thal', 'Num']
        df = pd.read_csv(dataset, names=header_row)
        df = df.replace('[?]', np.nan, regex=True)
        df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)
                          .fit_transform(df), columns=header_row)
        df = df.astype(float)
        return df

    def Train_Test_Split_data(self,dataset):
        Y = dataset['Num'].apply(lambda x: 1 if x > 0 else 0)
        X = dataset.drop('Num', axis=1)
        validation_size = 0.20
        seed = 42
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
        return X_train, X_test, Y_train, Y_test

    def Scaler(self, X_train, X_test):
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
        return X_train, X_test

    def Cross_Validate(self, clf, X_train, Y_train, cv=5):
        scores = cross_val_score(clf, X_train, Y_train, cv=cv, scoring='f1')
        score = scores.mean()
        print("CV scores mean: %.4f " % (score))
        return score, scores

    def Fit_Score(self, clf, X_train, Y_train, X_test, Y_test, label='x'):
        clf.fit(X_train, Y_train)
        fit_score = clf.score(X_train, Y_train)
        pred_score = clf.score(X_test, Y_test)
        print("%s: fit score %.5f, predict score %.5f" % (label, fit_score, pred_score))
        return pred_score

    def ReturnPredictionValue(self, clf, sample):
        y = clf.predict([sample])
        return y[0]

    def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):
        data = self.Read_Clean(dataset_path)
        X_train, X_test, Y_train, Y_test = self.Train_Test_Split_data(data)
        X_train, X_test = self.Scaler(X_train, X_test)
        self.NB = GaussianNB()
        self.Fit_Score(self.NB, X_train, Y_train, X_test, Y_test, label='NB')
        self.Cross_Validate(self.NB, X_train, Y_train, 10)
        return self.ReturnPredictionValue(self.NB, sample)
当我跑步时:

if __name__ == '__main__':
sample = [41.0, 0.0, 2.0, 130.0, 204.0, 0.0, 2.0, 172.0, 0.0, 1.4, 1.0, 0.0, 3.0]
p = Predict()
print "Prediction value: {}".format(p.PredictionMain(sample))
结果是:

注:拟合得分0.84711,预测得分0.83607 CV得分平均值:0.8000

预测值:1

我得到的是1而不是0(此示例已经是数据集示例之一)。 我对数据集中的多个样本进行了此操作,大多数情况下我得到的结果都是错误的,好像准确率不是80%

任何帮助都将不胜感激。 提前谢谢


编辑: 使用管道解决了这个问题。最后的代码是:

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

class Predict:
    def __init__(self):
        self.X = []
        self.Y = []

    def Read_Clean(self,dataset):
        header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',
                      'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',
                      'Exercise_Induced_Angina', 'OldPeak',
                      'Slope', 'CA', 'Thal', 'Num']
        df = pd.read_csv(dataset, names=header_row)
        df = df.replace('[?]', np.nan, regex=True)
        df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)
                          .fit_transform(df), columns=header_row)
        df = df.astype(float)
        return df

    def Split_Dataset(self, df):
        self.Y = df['Num'].apply(lambda x: 1 if x > 0 else 0)
        self.X = df.drop('Num', axis=1)

    def Create_Pipeline(self):
        estimators = []
        estimators.append(('standardize', StandardScaler()))
        estimators.append(('bayes', GaussianNB()))
        model = Pipeline(estimators)
        return model

    def Cross_Validate(self, clf, cv=5):
        scores = cross_val_score(clf, self.X, self.Y, cv=cv, scoring='f1')
        score = scores.mean()
        print("CV scores mean: %.4f " % (score))

    def Fit_Score(self, clf, label='x'):
        clf.fit(self.X, self.Y)
        fit_score = clf.score(self.X, self.Y)
        print("%s: fit score %.5f" % (label, fit_score))

    def ReturnPredictionValue(self, clf, sample):
        y = clf.predict([sample])
        return y[0]

    def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):
        print "dataset: "+ dataset_path
        data = self.Read_Clean(dataset_path)
        self.Split_Dataset(data)
        self.model = self.Create_Pipeline()
        self.Fit_Score(self.model, label='NB')
        self.Cross_Validate(self.model, 10)
        return self.ReturnPredictionValue(self.model, sample)
现在,对问题中的同一样本进行预测将返回[0],这是真实值。实际上,通过运行以下方法:

def CheckTrue(self):
    clf = self.Create_Pipeline()
    out = cross_val_predict(clf, self.X, self.Y)
    p = [out == self.Y]
    c = 0
    for i in range(303):
        if p[0][i] == True:
            c += 1
    print "Samples with true values: {}".format(c)

我使用管道代码获得了249个真实样本,而以前只有150个。

您没有将StandardScaler应用于样本。分类器需要在StandardScaler.transform输出上训练的缩放数据,但样本的缩放方式与训练中的不同


当手动组合多个步骤(缩放、预处理、分类)时,很容易犯这样的错误。为避免此类问题,最好使用scikit学习。

大多数情况下结果都是错误的-您能对整个样本或多个子集进行量化吗?在单个数据点上进行测试(在
ReturnPredictionValue
中返回
y[0]
)可能不足以得出任何有约束力的结论,无论分类器是否工作。旁注:您的代码都很好地包装在一个类中,但方法的行为类似于函数,即您几乎不在
Predict
属性中存储任何信息。将来,充分利用OOP的强大功能可能会节省您的时间。实际上,我编写了一个小代码,可以计算出真实值,303个值中有150个值,这不是80%的准确率。谢谢你的建议,你是对的,我没有注意到。谢谢你提到管道,我遇到了它,但从没想过它能解决我的问题,更不用说减少代码了,非常感谢!但是我仍然不知道我的代码出了什么问题,在将它们输入分类器之前,我缩放了X_序列和X_测试:
X_序列=缩放器。fit_transform(X_train)
X_测试=缩放器。transform(X_测试)
您缩放了
X_序列
X_测试
(这就是为什么交叉验证的质量很好的原因),但不是
示例
(==>单个示例的质量很差)。我明白了,谢谢!但使用管道,我没有缩放样本,效果很好。它是否也考虑了样本缩放?管道确保应用了所有步骤-您已经使用StandardScaler和GaussianNB创建了一个管道,因此它们都应用于培训和测试中。