Python 用gre线性回归预测入院率

Python 用gre线性回归预测入院率,python,pandas,machine-learning,jupyter-notebook,linear-regression,Python,Pandas,Machine Learning,Jupyter Notebook,Linear Regression,我正在学习线性回归,我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序,我使用的是kaggle的数据,这里是链接 为了预测GRE分数和入学机会之间的关系,我不断得到一个负斜率,即使它是正相关的 这是我正在执行的代码 import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (12.0, 9.0) # Preprocessi

我正在学习线性回归,我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序,我使用的是kaggle的数据,这里是链接
为了预测GRE分数和入学机会之间的关系,我不断得到一个负斜率,即使它是正相关的

这是我正在执行的代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()

m = 0
c = 0

L = 0.0001  # The learning Rate
epochs = 10000  # The number of iterations to perform gradient descent

n = float(len(X)) # Number of elements in X

# Performing Gradient Descent 
for i in range(epochs): 
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c
    
print (m, c)


当我打印m和c时,我得到了'nan'和'nan'作为输出,我做错了什么?

这里的问题是学习率。如果你降低学习速度,你可以得到一个好的适合

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]

m = 0
c = 0
L = 0.0000001  # The learning Rate
epochs = 100  # The number of iterations to perform gradient descent
n = float(len(X))  # Number of elements in X

# Performing Gradient Descent
for i in range(epochs):    
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c    

print("Slope, Intercept:", m, c)

plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()
输出:

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06
Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66

如果您使用
scikit-learn
implementation,您会得到更好的匹配。因为它使用
归一化
最小二乘估计
方法,而不是梯度下降

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('gre.csv')

X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X, y)

y_pred = regr.predict(X)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))

# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y,  color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
输出:

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06
Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66