Python 3.x 如何从训练模型预测测试数据
我建立了一个逻辑回归模型来预测条目的贷款状态,即Y或N。我认为它适用于培训数据,但当我将其应用于测试数据时,它失败了 我想我把问题缩小到测试数据集中的已婚列,已婚数据缺少数据,但测试集是完整的 这是我用于插补的函数Python 3.x 如何从训练模型预测测试数据,python-3.x,machine-learning,scikit-learn,logistic-regression,Python 3.x,Machine Learning,Scikit Learn,Logistic Regression,我建立了一个逻辑回归模型来预测条目的贷款状态,即Y或N。我认为它适用于培训数据,但当我将其应用于测试数据时,它失败了 我想我把问题缩小到测试数据集中的已婚列,已婚数据缺少数据,但测试集是完整的 这是我用于插补的函数 def impute_married(cols): Married = cols[0] if pd.isnull(Married): return 'unknownMarriedStatus' else: return Mar
def impute_married(cols):
Married = cols[0]
if pd.isnull(Married):
return 'unknownMarriedStatus'
else:
return Married
因此,当它用于测试数据时,不存在未知的载波状态,这就是它不用于测试集的原因
下一节将转换模型的数据
my_dict = {'0':'zero','1':'one','2':'two','3':'three','3+':'threePlus',
np.nan: 'missing'}
def convert_data(dataset):
temp_data = dataset.copy()
temp_data.Dependents = temp_data.Dependents.map(my_dict)
temp_data['Gender'] = temp_data[['Gender']].apply(impute_gender,axis=1)
temp_data['Married'] = temp_data[['Married']].apply(impute_married,axis=1)
temp_data['Self_Employed'] = temp_data[['Self_Employed']].apply(impute_self_employed,axis=1)
temp_data['Credit_History'] = temp_data[['Credit_History']].apply(impute_Credit_History,axis=1)
dependents = pd.get_dummies(temp_data['Dependents'],drop_first=True)
gender = pd.get_dummies(temp_data['Gender'],drop_first=True)
married = pd.get_dummies(temp_data['Married'],drop_first=True)
education = pd.get_dummies(temp_data['Education'],drop_first=True)
self_employed = pd.get_dummies(temp_data['Self_Employed'],drop_first=True)
credit_history = pd.get_dummies(temp_data['Credit_History'],drop_first=True)
property_area = pd.get_dummies(temp_data['Property_Area'],drop_first=True)
#loan_status = pd.get_dummies(temp_data['Loan_Status'],drop_first=True)
loan_band = pd.get_dummies(temp_data['Loan_Band'],drop_first=True)
temp_data.drop(['Loan_ID', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Gender','Married','Dependents','Education','Self_Employed','Credit_History',
'Property_Area','Loan_Band'],axis=1,inplace=True)
temp_data = pd.concat([temp_data,dependents,gender,married, education, self_employed, credit_history, property_area, loan_band ],axis=1)
temp_data.dropna(inplace=True)
return temp_data
return temp_data
train_dataset = convert_data(train)
然后
其次是逻辑回归
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(solver="lbfgs")
logmodel.fit(X_train,y_train)
y_pred = logmodel.predict(X_test)
predictions = logmodel.predict(train_dataset.drop('Loan_Status', axis = 1))
这就给了我报告
from sklearn.metrics import classification_report
print(classification_report(train_dataset['Loan_Status'],predictions))
precision recall f1-score support
N 0.86 0.45 0.59 192
Y 0.80 0.97 0.87 422
avg / total 0.82 0.81 0.79 614
下一部分是报告
from sklearn.metrics import classification_report
print(classification_report(train_dataset['Loan_Status'],predictions))
precision recall f1-score support
N 0.86 0.45 0.59 192
Y 0.80 0.97 0.87 422
avg / total 0.82 0.81 0.79 614
接下来是准确度
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logmodel.score(X_test, y_test)))
pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=
['Predicted Result'])
test_dataset = convert_data(test)
predictions = logmodel.predict(test_dataset.drop('Loan_Status', axis = 1))
这就给了我一个错误
ValueError Traceback (most recent call last)
<ipython-input-144-474a34c1dffa> in <module>()
1 #predictions = logmodel.predict(test_dataset)
----> 2 predictions = logmodel.predict(test_dataset.drop('Loan_Status', axis = 1))
~/anaconda3_420/lib/python3.5/site-packages/sklearn/linear_model/base.py in predict(self, X)
322 Predicted class label per sample.
323 """
--> 324 scores = self.decision_function(X)
325 if len(scores.shape) == 1:
326 indices = (scores > 0).astype(np.int)
~/anaconda3_420/lib/python3.5/site-packages/sklearn/linear_model/base.py in decision_function(self, X)
303 if X.shape[1] != n_features:
304 raise ValueError("X has %d features per sample; expecting %d"
--> 305 % (X.shape[1], n_features))
306
307 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 23 features per sample; expecting 24
(614,17)
(614,25)
(367,16)
(367、23)
索引([‘贷款状况’、‘申请者收入记录’、‘共同申请者收入经验’,
‘LoanAmount_log’、‘一’、‘三加’、‘二’、‘零’、‘男’,
“未知者”、“是”、“未知婚姻状态”、“未毕业”、“是”,
“unknownEmploymentStatus”、“Yes”、“UnknownEmploymentHist”、“Semiurban”,
“市区”、“B”、“C”、“D”、“E”、“F”、“H”],
dtype='object')
索引(['ApplicationCome\u log'、'CoApplicationIncome\u exp'、'LoanAmount\u log'、'one',
‘三加’、‘二’、‘零’、‘男’、‘未知者’、‘是’,
‘未毕业’、‘是’、‘未知就业状况’、‘是’,
“未知历史”、“半城市”、“城市”、“B”、“C”、“D”、“E”、“F”、“H”],
dtype='object')
我想让它,所以我有一个逻辑回归模型工作的列车数据,然后我可以应用到测试数据
我被告知我的汇总统计数据是基于整个
test.csv
数据集的,但我根据这个数据集进行训练,所以我的分数会被夸大。但我不明白这是什么意思,也不知道如何解决它 如果使用第二行predictions=logmodel.predict(test\u dataset.drop('Loan\u Status',axis=1))
它应该可以工作。您的错误回溯和代码不一致。请recheck@VenkatachalamN很抱歉延迟回复,我想我发现了错误,但不知道如何修复它。这是关于已婚专栏的,我更新了帖子来展示这一点。基本上,当数据丢失时,会添加一个新列,但我该如何处理列车组中丢失的数据,当测试集完成且插补的列车集获得额外的column@VenkatachalamN我刚刚添加了一些形状,以防如果您使用第二行predictions=logmodel.predict(test\u dataset.drop('Loan\u Status',axis=1))
它应该可以工作,那么可以更好地理解它。您的错误回溯和代码不一致。请recheck@VenkatachalamN很抱歉延迟回复,我想我发现了错误,但不知道如何修复它。这是关于已婚专栏的,我更新了帖子来展示这一点。基本上,当数据丢失时,会添加一个新列,但当test.set完成且插补的列车集获得额外数据时,我该如何处理列车集中丢失的数据column@VenkatachalamN我只是添加了一些形状,以便更好地理解它
train_dataset.shape
test.shape
test_dataset.shape
train_dataset.columns
test_dataset.columns