Machine learning Logistic回归预测故障
我一直在努力解决泰坦尼克号的这个问题。我把x分成乘客,y分成幸存者。但问题是我无法得到y_pred(ie)预测结果。因为所有值都是0。我得到0作为预测值。如果有人能解决这个问题,对我会有帮助的。因为这是我作为初学者的第一个分类器问题Machine learning Logistic回归预测故障,machine-learning,classification,logistic-regression,prediction,Machine Learning,Classification,Logistic Regression,Prediction,我一直在努力解决泰坦尼克号的这个问题。我把x分成乘客,y分成幸存者。但问题是我无法得到y_pred(ie)预测结果。因为所有值都是0。我得到0作为预测值。如果有人能解决这个问题,对我会有帮助的。因为这是我作为初学者的第一个分类器问题 import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('C:/Users/Umer/train.csv') x = df['Passenge
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Users/Umer/train.csv')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train,y_train)
#predicting the test set results
y_pred = classifier.predict(x_test)
我无法重现相同的结果,事实上,我复制粘贴了您的代码,并没有像您描述的那样将它们全部置为零,相反,我得到了:
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
然而,我注意到您的方法中有几点您可能想知道:
,
,因此,如果数据集变量由选项卡隔开(与我的选项卡相同),则应指定如下分隔符:
df = pd.read_csv('titanic.csv', sep='\t')
PassengerId
没有有用的信息,您的模型可以从中学习,以预测幸存的
人,它只是一个连续的数字,随着每个新乘客的增加而增加。一般来说,在分类中,您需要利用使您的模型从中学习的所有特征(当然,除非有冗余特征不向模型添加任何信息),尤其是在您的数据集中,它是一个多变量数据集StandardScaler
,您应该将数据获取为typefloat
,以避免出现如下警告:
DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
所以你从一开始就这样转换:
x = df['PassengerId'].values.astype(float).reshape(-1,1)
更新 在提供数据集之后,您得到的结果是正确的,这也是因为我上面提到的原因编号
2
(即PassengerId
没有为模型提供有用的信息,因此它无法正确预测!)
您可以通过比较从数据集中添加更多功能之前和之后的数据来测试它:
from sklearn.metrics import log_loss
df = pd.read_csv('train.csv', sep=',')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
random_state = 0)
classifier = LogisticRegression()
classifier.fit(x_train,y_train)
y_pred_train = classifier.predict(x_train)
# calculate and print the loss function using only the PassengerId
print(log_loss(y_train, y_pred_train))
#predicting the test set results
y_pred = classifier.predict(x_test)
print(y_pred)
输出
13.33982681120802
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
7.238735137632405
[0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
1]
现在,通过使用许多“假定有用”的信息:
from sklearn.metrics import log_loss
df = pd.read_csv('train.csv', sep=',')
# denote the words female and male as 0 and 1
df['Sex'].replace(['female','male'], [0,1], inplace=True)
# try three features that you think they are informative to the model
# so it can learn from them
x = df[['Fare', 'Pclass', 'Sex']].values.reshape(-1,3)
y = df['Survived']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
random_state = 0)
classifier = LogisticRegression()
classifier.fit(x_train,y_train)
y_pred_train = classifier.predict(x_train)
# calculate and print the loss function with the above 3 features
print(log_loss(y_train, y_pred_train))
#predicting the test set results
y_pred = classifier.predict(x_test)
print(y_pred)
输出
13.33982681120802
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
7.238735137632405
[0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
1]
总之:
正如你所看到的,损失给出了更好的价值(比以前小),预测现在更合理 谢谢你的回答。我仍然得到了和我预测的一样的零。。这里有一个数据集链接,您可以分享您的结果@UmerSalman我更新了我的答案,如果它对您有帮助,请接受它。@Yahya这对初学者来说真的很有帮助。:)非常感谢。