Python 用gre线性回归预测入院率
我正在学习线性回归,我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序,我使用的是kaggle的数据,这里是链接Python 用gre线性回归预测入院率,python,pandas,machine-learning,jupyter-notebook,linear-regression,Python,Pandas,Machine Learning,Jupyter Notebook,Linear Regression,我正在学习线性回归,我正试图用python在Jupyter笔记本上制作一个简单的线性回归程序,我使用的是kaggle的数据,这里是链接 为了预测GRE分数和入学机会之间的关系,我不断得到一个负斜率,即使它是正相关的 这是我正在执行的代码 import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (12.0, 9.0) # Preprocessi
为了预测GRE分数和入学机会之间的关系,我不断得到一个负斜率,即使它是正相关的 这是我正在执行的代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)
# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()
m = 0
c = 0
L = 0.0001 # The learning Rate
epochs = 10000 # The number of iterations to perform gradient descent
n = float(len(X)) # Number of elements in X
# Performing Gradient Descent
for i in range(epochs):
Y_pred = m*X + c # The current predicted value of Y
D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m
D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c
m = m - L * D_m # Update m
c = c - L * D_c # Update c
print (m, c)
当我打印m和c时,我得到了'nan'和'nan'作为输出,我做错了什么?这里的问题是学习率。如果你降低学习速度,你可以得到一个好的适合
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)
# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
m = 0
c = 0
L = 0.0000001 # The learning Rate
epochs = 100 # The number of iterations to perform gradient descent
n = float(len(X)) # Number of elements in X
# Performing Gradient Descent
for i in range(epochs):
Y_pred = m*X + c # The current predicted value of Y
D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m
D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c
m = m - L * D_m # Update m
c = c - L * D_c # Update c
print("Slope, Intercept:", m, c)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()
输出:
Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06
Coefficients:
[0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66
如果您使用scikit-learn
implementation,您会得到更好的匹配。因为它使用归一化
和最小二乘估计
方法,而不是梯度下降
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
data = pd.read_csv('gre.csv')
X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(X, y)
y_pred = regr.predict(X)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))
# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
输出:
Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06
Coefficients:
[0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66