Python 将数据帧拟合到线性回归中_Python_Pandas_Scikit Learn

Python 将数据帧拟合到线性回归中

python pandas scikit-learn

Python 将数据帧拟合到线性回归中,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我正在为一个班级做一个项目，我正在尝试使用线性回归和sklearn的预测函数预测nfl足球比赛，我的问题是当我想将训练数据拟合到去拟合函数中时，我的代码如下： onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent']) # Crea el object de regression linear regr = linear_model.LinearRegression() # Train the mod

我正在为一个班级做一个项目，我正在尝试使用线性回归和sklearn的预测函数预测nfl足球比赛，我的问题是当我想将训练数据拟合到去拟合函数中时，我的代码如下：

onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])

# Crea el object de regression linear
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])

这是dataframe（目标\模型\数据）的结构：

这是我运行程序时遇到的错误：

Traceback (most recent call last):
  File "predictnflgames.py", line 76, in <module>
    regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
    return self._getitem_array(key)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['team' 'opponent'] not in index"

回溯（最近一次呼叫最后一次）：
文件“predictnflgames.py”，第76行，在
积分拟合（onehotdata_x1['主场'，'球队'，'对手']]，onehotdata_x1['进球'）
文件“C:\Python27\lib\site packages\pandas\core\frame.py”，第2133行，在\uu getitem中__
返回self.\u getitem\u数组（键）
文件“C:\Python27\lib\site packages\pandas\core\frame.py”，第2177行，在\u getitem\u数组中
索引器=self.loc.\u转换为索引器（键，轴=1）
文件“C:\Python27\lib\site packages\pandas\core\index.py”，第1269行，在\u convert\u to\u indexer中
.format（mask=objarr[mask]））
KeyError:“[“团队”“对手”]不在索引中”

当您使用

pd.get_dummies（goal\u model\u data，columns=['team'，'author']）

团队和

对手

列将从您的数据框中删除，并且

onehotdata\u x1

将不包含这两列

然后，当您执行

onehotdata_x1[['home'，'team'，'author']]

时，您会得到一个

keyrerror

，这仅仅是因为

team

和

authort

在

onehotdata_x1

数据框中不存在列

使用玩具数据框，会发生以下情况：

问题是，在

pd.get_dummies

之后，没有

团队

和

对手

列

我使用txt格式的数据作为示例：（与您的相同）

试试看：

import pandas as pd
from sklearn.linear_model import LinearRegression

goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)

onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])

regr = LinearRegression()

#see the columns in onehotdata_x1
onehotdata_x1.columns

#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)

Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
       u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
       u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
       u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
       dtype='object')

结果：

import pandas as pd
from sklearn.linear_model import LinearRegression

goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)

onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])

regr = LinearRegression()

#see the columns in onehotdata_x1
onehotdata_x1.columns

#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)

Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
       u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
       u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
       u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
       dtype='object')

编辑1

根据原始代码，您可能需要执行以下操作：

import pandas as pd
from sklearn.linear_model import LinearRegression

data = pd.read_table('data.txt', delim_whitespace=True)

onehotdata = pd.get_dummies(data,columns=['team','opponent'])

regr = LinearRegression()

#in x get all columns except goals column
x = onehotdata.loc[:, onehotdata.columns != 'goals']

#use goals column as target variable
y= onehotdata['goals']

regr.fit(x,y)
regr.predict(x)

希望这有帮助。

您可以添加

onehotdata_x1.head（）

的输出吗？您正在尝试访问使用pd后不存在的列。找些傻瓜。更多细节请参见我的答案作为旁注，在拟合线性回归时，你不应该使用

团队

和

对手

列作为自变量，因为它们是分类的，线性回归只适用于隐藏的数字。您应该使用创建的虚拟变量。但是我如何使用创建的虚拟变量？@xtrios您应该打印

onehotdata\u x1

，以了解此处发生的情况。您将看到与原始的

goal\u model\u data

dataframe相比创建的两个新列。这些新列是虚拟变量。您希望选择这些，以及

home

变量作为特征，并使用

goals

列作为目标。实际上

pd.get\u dummies（）

只会删除

团队

和

对手

列<代码>目标和

主页

将保留。您还可以看到，

KeyError

仅针对

团队

和

对手

返回。在

pd.get_dummies

之后，没有

团队

和

对手

列。干杯你应该纠正你的第一句话，如果你不这么说的话。这不仅仅是一个愚蠢的错误。它代表着更大的东西。一种思维上的错误，使你成为真正的你。（只是开玩笑，看了看你写的个人资料：））那么我怎么能使用你制作的假人呢？