Python 用线性回归处理缺失值

Python 用线性回归处理缺失值,python,pandas,scikit-learn,linear-regression,Python,Pandas,Scikit Learn,Linear Regression,我试图用线性回归处理其中一列中的缺失值 该列的名称是“Landsize”,我正在尝试预测NaN值​​使用其他几个变量进行线性回归 以下是lin。回归代码: # Importing the dataset dataset = pd.read_csv('real_estate.csv') from sklearn.linear_model import LinearRegression linreg = LinearRegression() data = dataset[['Price','Roo

我试图用线性回归处理其中一列中的缺失值

该列的名称是“Landsize”,我正在尝试预测NaN值​​使用其他几个变量进行线性回归

以下是lin。回归代码:

# Importing the dataset
dataset = pd.read_csv('real_estate.csv')

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = dataset[['Price','Rooms','Distance','Landsize']]
#Step-1: Split the dataset that contains the missing values and no missing values are test and train respectively.
x_train = data[data['Landsize'].notnull()].drop(columns='Landsize')
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
#Step-2: Train the machine learning algorithm
linreg.fit(x_train, y_train)
#Step-3: Predict the missing values in the attribute of the test data.
predicted = linreg.predict(x_test)
#Step-4: Let’s obtain the complete dataset by combining with the target attribute.
dataset.Landsize[dataset.Landsize.isnull()] = predicted
dataset.info()
当我尝试检查回归结果时,我得到以下错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
准确度:

accuracy = linreg.score(x_test, y_test)
print(accuracy*100,'%')

我认为这里的错误是将NaN值传递给算法,处理NaN值是预处理数据的主要步骤之一。因此,可能需要将NaN值转换为0,并预测Landsize=0的时间(这与逻辑上使用NaN值相同,因为Landsize不能为0)

我认为你做错的另一件事是:

x_train = data[data['Landsize'].notnull()].drop(columns='Landsize') 
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
您正在为培训和测试集分配相同的数据。你或许应该:

X = data[data['Landsize'].notnull()].drop(columns='Landsize')    
y = data[data['Landsize'].notnull()]['Landsize']  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

您是否将“Nan”转换为“数值Nan值?”您不必更改算法,您的问题是一个回归问题,因此回归算法可以解决此问题,您只需将数据与问题相适应即可;)80%的机器学习是数据科学,并将您的数据调整为合适的格式。