Python 为什么我的线性回归表现不好？_Python_Scikit Learn_Linear Regression_Correlation

Python 为什么我的线性回归表现不好？

python scikit-learn

Python 为什么我的线性回归表现不好？,python,scikit-learn,linear-regression,correlation,Python,Scikit Learn,Linear Regression,Correlation,我创建了一个线性回归模型，根据温度和太阳辐照度值预测发电量。变量之间的相关值如下所示： Power Generation 1.000000 Solar Irradiance 0.437181 Temperature 0.571839 TimestampDay -0.239430 以下是散点图：然而，结果真的很糟糕。预测的R²得分为-0.339，它甚至没有正确地拾取趋势。我觉得这很奇怪，因为变量有相当好的相关值。难道这仅仅是不够的吗？我的数据的季节效应

我创建了一个线性回归模型，根据

温度

和

太阳辐照度

值预测

发电量

。变量之间的相关值如下所示：

Power Generation    1.000000
Solar Irradiance    0.437181
Temperature         0.571839
TimestampDay       -0.239430

以下是散点图：

然而，结果真的很糟糕。预测的R²得分为

-0.339

，它甚至没有正确地拾取趋势。我觉得这很奇怪，因为变量有相当好的相关值。难道这仅仅是不够的吗？我的数据的季节效应是否与线性回归的糟糕表现有关

以下是完整的代码：

import pandas as pd
import numpy as np
from sklearn import metrics
from osisoft.pidevclub.piwebapi.pi_web_api_client import PIWebApiClient
from pandas.plotting import register_matplotlib_converters
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")

client = PIWebApiClient("https://localhost/piwebapi", useKerberos=False, username="svc_tcc", password="9Dw#7gbb", verifySsl=False)

plantaID = 1

if plantaID == 1: planta = "Karnak"
elif plantaID == 2: planta = "Piney Woods"
elif plantaID == 3: planta = "Bryce Canyon"

path1 = "pi:\EC2AMAZ-8VHQOQJ\\" + planta + ".Total Power Generation Actual"
path2 = "pi:\EC2AMAZ-8VHQOQJ\\" + planta + ".Solar Irradiance Actual"
path3 = "pi:\EC2AMAZ-8VHQOQJ\\" + planta + ".Temperature Actual"

df1 = client.data.get_recorded_values(path=path1,
                                     start_time='1-24mo',
                                     end_time='1',
                                     max_count=100000)
df2 = client.data.get_recorded_values(path=path2,
                                     start_time='1-24mo',
                                     end_time='1',
                                     max_count=100000)
df3 = client.data.get_recorded_values(path=path3,
                                     start_time='1-24mo',
                                     end_time='1',
                                     max_count=100000)

data = [df1.Value, df2.Value, df3.Value, df1.Timestamp]
headers = ["Power Generation", "Solar Irradiance", "Temperature", "Timestamp"]
df = pd.concat(data, axis=1, keys=headers)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
register_matplotlib_converters()
df['TimestampDay'] = (df['Timestamp'] - df['Timestamp'].min())  / np.timedelta64(1,'D')

df_new = df.query('`Solar Irradiance` != 0')

pct_train = 0.9

y = df_new['Power Generation']
X_multiplo = df_new[['Solar Irradiance','Temperature']]

size_train = int(len(y)*pct_train)

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multiplo, y, test_size=(1-pct_train), shuffle=False)

model = LinearRegression()
model.fit(X_train_m, y_train_m)

print('R² = {}'.format(model.score(X_train_m, y_train_m).round(3))) # Train adjustment

y_treino_previsto_m = model.predict(X_train_m)
y_teste_previsto_m = model.predict(X_test_m)

print('R² = %s' % metrics.r2_score(y_test_m, y_teste_previsto_m).round(3)) # Prediction score

我建议多做些实验

尝试运行随机林回归器，查看性能是否发生变化

还建议将您的

列车大小

更改为

0.7或0.8

，并查看性能

R^2不是唯一的评估技术，您可以尝试MSE和MAPE

根据目标变量的目标，MAPE（平均绝对百分比误差）可能是更好的评估技术。如果你想表示时间，那么考虑建立时间序列模型

regr = RandomForestRegressor(max_depth=10, random_state=0)
regr.fit(X, y)

马佩

RMSE

您需要向我们展示您的代码。我不确定你是如何用两个自变量做线性回归的。你不应该给我们展示功率与太阳辐射或功率与温度的关系，而不是功率与时间的关系吗？这将告诉我们线性是否是正确的方法。@TimRoberts很抱歉，我忘了添加散点图。只是添加了一个编辑。代码呢？C：这不是你的问题，但我不鼓励用线性回归来解决这个问题。@DMeneses刚刚添加了完整的代码！：）