Python 用gre线性回归预测入院率_Python_Pandas_Machine Learning_Jupyter Notebook_Linear Regression

Python 用gre线性回归预测入院率

python pandas machine-learning jupyter-notebook

Python 用gre线性回归预测入院率,python,pandas,machine-learning,jupyter-notebook,linear-regression,Python,Pandas,Machine Learning,Jupyter Notebook,Linear Regression,我正在学习线性回归，我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序，我使用的是kaggle的数据，这里是链接为了预测GRE分数和入学机会之间的关系，我不断得到一个负斜率，即使它是正相关的这是我正在执行的代码 import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (12.0, 9.0) # Preprocessi

我正在学习线性回归，我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序，我使用的是kaggle的数据，这里是链接
为了预测GRE分数和入学机会之间的关系，我不断得到一个负斜率，即使它是正相关的

这是我正在执行的代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()

m = 0
c = 0

L = 0.0001  # The learning Rate
epochs = 10000  # The number of iterations to perform gradient descent

n = float(len(X)) # Number of elements in X

# Performing Gradient Descent 
for i in range(epochs): 
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c
    
print (m, c)

当我打印m和c时，我得到了'nan'和'nan'作为输出，我做错了什么？

这里的问题是学习率。如果你降低学习速度，你可以得到一个好的适合

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]

m = 0
c = 0
L = 0.0000001  # The learning Rate
epochs = 100  # The number of iterations to perform gradient descent
n = float(len(X))  # Number of elements in X

# Performing Gradient Descent
for i in range(epochs):    
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c    

print("Slope, Intercept:", m, c)

plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()

输出：

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06

Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66

如果您使用

scikit-learn

implementation，您会得到更好的匹配。因为它使用

归一化

和

最小二乘估计

方法，而不是梯度下降

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('gre.csv')

X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X, y)

y_pred = regr.predict(X)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))

# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y,  color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

输出：

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06

Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66