Machine learning Logistic回归预测故障

Machine learning Logistic回归预测故障,machine-learning,classification,logistic-regression,prediction,Machine Learning,Classification,Logistic Regression,Prediction,我一直在努力解决泰坦尼克号的这个问题。我把x分成乘客,y分成幸存者。但问题是我无法得到y_pred(ie)预测结果。因为所有值都是0。我得到0作为预测值。如果有人能解决这个问题,对我会有帮助的。因为这是我作为初学者的第一个分类器问题 import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('C:/Users/Umer/train.csv') x = df['Passenge

我一直在努力解决泰坦尼克号的这个问题。我把x分成乘客,y分成幸存者。但问题是我无法得到y_pred(ie)预测结果。因为所有值都是0。我得到0作为预测值。如果有人能解决这个问题,对我会有帮助的。因为这是我作为初学者的第一个分类器问题

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv('C:/Users/Umer/train.csv')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']


from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25, 
random_state = 0)


from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train,y_train)

#predicting the test set results


y_pred = classifier.predict(x_test)

我无法重现相同的结果,事实上,我复制粘贴了您的代码,并没有像您描述的那样将它们全部置为零,相反,我得到了:

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
然而,我注意到您的方法中有几点您可能想知道:

  • 中的默认分隔符是
    ,因此,如果数据集变量由
    选项卡隔开(与我的选项卡相同),则应指定如下分隔符:

    df = pd.read_csv('titanic.csv', sep='\t')
    
  • PassengerId
    没有有用的信息,您的模型可以从中学习,以预测
    幸存的
    人,它只是一个连续的数字,随着每个新乘客的增加而增加。一般来说,在分类中,您需要利用使您的模型从中学习的所有特征(当然,除非有冗余特征不向模型添加任何信息),尤其是在您的数据集中,它是一个多变量数据集

  • 没有必要缩放PassengerId,因为它通常在特征在量级、单位和范围(例如5kg和5000gms)上高度变化时使用,正如我提到的,在您的情况下,它只是一个增量整数,对模型没有实际信息

  • 最后一件事,对于
    StandardScaler
    ,您应该将数据获取为type
    float
    ,以避免出现如下警告:

    DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
    
    所以你从一开始就这样转换:

    x = df['PassengerId'].values.astype(float).reshape(-1,1)
    
  • 最后,如果仍然得到相同的结果,请添加到数据集的链接


    更新 在提供数据集之后,您得到的结果是正确的,这也是因为我上面提到的原因编号
    2
    (即
    PassengerId
    没有为模型提供有用的信息,因此它无法正确预测!)

    您可以通过比较从数据集中添加更多功能之前和之后的数据来测试它:

    from sklearn.metrics import log_loss
    df = pd.read_csv('train.csv', sep=',')
    x = df['PassengerId'].values.reshape(-1,1)
    y = df['Survived']
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
    random_state = 0)
    classifier = LogisticRegression()
    classifier.fit(x_train,y_train)
    y_pred_train = classifier.predict(x_train)
    # calculate and print the loss function using only the PassengerId
    print(log_loss(y_train, y_pred_train))
    #predicting the test set results
    y_pred = classifier.predict(x_test)
    print(y_pred)
    
    输出

    13.33982681120802
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0]
    
    7.238735137632405
    [0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
     0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
     0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
     1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
     1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
     0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
     1]
    
    现在,通过使用许多“假定有用”的信息:

    from sklearn.metrics import log_loss
    df = pd.read_csv('train.csv', sep=',')
    # denote the words female and male as 0 and 1
    df['Sex'].replace(['female','male'], [0,1], inplace=True)
    # try three features that you think they are informative to the model
    # so it can learn from them
    x = df[['Fare', 'Pclass', 'Sex']].values.reshape(-1,3)
    y = df['Survived']
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
    random_state = 0)
    classifier = LogisticRegression()
    classifier.fit(x_train,y_train)
    y_pred_train = classifier.predict(x_train)
    # calculate and print the loss function with the above 3 features
    print(log_loss(y_train, y_pred_train))
    #predicting the test set results
    y_pred = classifier.predict(x_test)
    print(y_pred)
    
    输出

    13.33982681120802
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0]
    
    7.238735137632405
    [0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
     0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
     0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
     1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
     1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
     0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
     1]
    

    总之:


    正如你所看到的,损失给出了更好的价值(比以前小),预测现在更合理

    谢谢你的回答。我仍然得到了和我预测的一样的零。。这里有一个数据集链接,您可以分享您的结果@UmerSalman我更新了我的答案,如果它对您有帮助,请接受它。@Yahya这对初学者来说真的很有帮助。:)非常感谢。