Python 我的线性回归模型显示分数为10%，我如何改进它？_Python_Machine Learning_Linear Regression

Python 我的线性回归模型显示分数为10%，我如何改进它？

python machine-learning

Python 我的线性回归模型显示分数为10%，我如何改进它？,python,machine-learning,linear-regression,Python,Machine Learning,Linear Regression,目标-根据4栋房屋的2个特征（即特征1和特征2），预测每平方英尺的价格。我有7套房子，有特色1、特色2和每平方英尺的价格。最后4栋房子只有“特色1”和“特色2”。我知道那里应该有什么价值观。当我将其与我的[预测值]进行比较时，它是完全不同的我的代码-我有一个CSV文件，我读取它并将其转换为一个pandas数据帧，从中我使用线性回归对模型进行训练和测试数据-这是我的数据快照，这是我正在使用的数据，我需要预测最后4个“Pricepersqrft”值问题- 我无法得到超过10%的准确率，这意味

目标-根据4栋房屋的2个特征（即特征1和特征2），预测每平方英尺的价格。我有7套房子，有特色1、特色2和每平方英尺的价格。最后4栋房子只有“特色1”和“特色2”。我知道那里应该有什么价值观。当我将其与我的[预测值]进行比较时，它是完全不同的

我的代码-我有一个CSV文件，我读取它并将其转换为一个pandas数据帧，从中我使用线性回归对模型进行训练和测试

数据-这是我的数据快照，这是我正在使用的数据，我需要预测最后4个“Pricepersqrft”值

问题- 我无法得到超过10%的准确率，这意味着我没有得到最后4栋房子的正确的“价格”

这是我的密码-

import numpy as np
import pandas as pd
import scipy 
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import datasets

csvfileData = THE DATA SHOWN IN THE SNAPSHOT
dataRead = pd.read_csv(csvfileData)
dfCreated = pd.DataFrame(dataRead) #creating a pandas dataframe
print(dfCreated)
# print(dfCreated.head()) #shows first 5 rows of data frame

dfCreated.drop(dfCreated.columns[[0]], axis=1, inplace = True)
print(dfCreated)

# where_are_NaNs = numpy.isnan(dfCreated) #previous line displayed Nan where no value was present for "Pricepersqrft column"
# dfCreated[where_are_NaNs] = 0 #use numpy's isnan and set all Nan to 0
# print(dfCreated)
dfCreated.hist(bins = 10, figsize=(20,15)) #plotting histograms using matplotlib
plt.show()

#creating scatter plots 
dfCreated.plot(kind="scatter", x= "Feature1", y="Feature2", alpha=0.5)
correlationMatrix = dfCreated.corr() #computes correlation between 2 columns 
print(correlationMatrix["Feature1"].sort_values(ascending=False))

#value that needs to be predicted
Y= dfCreated['Pricepersqrft']
print(Y)  

#training the model and testing, train_test_split expects both dataframes to be of same length
X_train, X_test, Y_train, Y_test = train_test_split(dfCreated, Y, test_size=0.20, random_state=0)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

reg = LinearRegression()
reg.fit(X_train, Y_train)
#predictions = reg.predict(X_test)
#print(predictions)
reg.score(X_test, Y_test)

最后四个“Pricepersqrft”的值分别为105.22、142.68、132.94和129.71

您使用的是pd.read_csv，它只返回pandas数据帧，因此无需使用pd.DataFrame

您正在对整个数据进行随机分割的列车测试，如何确保它将最后的观测值作为测试数据

把你想要预测的所有观察值作为测试数据，把其他的作为训练数据。此外，如果你在这里展示的数据是你所有的，那么预测可能不好，因为观察值的数量较少

使用iloc为n行数建立基于整数位置的索引

train_data = data.iloc[0:m]
test_data = data.iloc[m:n+1]

您使用的是pd.read_csv，它只返回pandas数据帧，因此无需使用pd.DataFrame

您正在对整个数据进行随机分割的列车测试，如何确保它将最后的观测值作为测试数据

使用iloc为n行数建立基于整数位置的索引

train_data = data.iloc[0:m]
test_data = data.iloc[m:n+1]

请注意，在回归的情况下，

score

不会返回准确性。它返回预测的确定系数R^2。最好的可能分数是1.0。因此，0.1的分数实际上可能不好，但这可能是因为数据样本很少。实际上分数是1.0，我认为应该转换为p百分位，也就是准确度。谢谢much@SruthiV，除此之外，我担心的是，预测数据分别在4栋房子的“139、132、137、129”左右，而实际上应该在“105.22、142.68、132.94和129.71”附近分别。请注意，在回归的情况下，

score

不会返回准确性。它返回预测的确定系数R^2。最好的可能分数是1.0。因此，0.1的分数实际上可能不好，但这可能是因为数据样本很少。实际上，分数是1.0，我认为是这样转换成百分比，这意味着准确性。谢谢much@SruthiV，除此之外，我担心的是，预测数据分别在4栋房子的“139、132、137、129”左右，而实际上应该在“105.22、142.68、132.94和129.71”附近分别。非常感谢。我更改了数据框的内容。另外，在列车测试分割中，我给了测试_大小0.2。我通过打印数据的形状进行检查。我得到7作为列车数据的形状，4作为测试数据的形状。是的，我的数据只是11个观察值，现在我知道提供的数据太少，无法进行任何计算待执行的ons。非常感谢。非常感谢。我更改了数据框的内容。此外，在列车测试拆分中，我将测试_大小设置为0.2。我通过打印数据的形状进行检查。我得到7作为列车数据的形状，4作为测试数据的形状。是的，我的数据只是11个观察值，现在我知道提供的数据是t方式对于要执行的任何计算，oo更少。非常感谢。