Python 机器学习和线性回归-预期2D数组,重塑数据

Python 机器学习和线性回归-预期2D数组,重塑数据,python,arrays,machine-learning,linear-regression,prediction,Python,Arrays,Machine Learning,Linear Regression,Prediction,我是机器学习新手,所以我决定使用口袋妖怪数据集编写一个测试程序,根据“总”数据预测“捕获率”。我想对我的训练数据使用线性回归。但是当我运行我的程序时,我得到以下错误: 应为2D数组,改为1D数组:数组=['190''90''45''125' '190' '75' '45' '120' '200' '45' '190' '60' '225' '90' '3' '45' '150' '120' '45' '3' '3' '255' '90' '45' '45' '45' '255' '225'

我是机器学习新手,所以我决定使用口袋妖怪数据集编写一个测试程序,根据“总”数据预测“捕获率”。我想对我的训练数据使用线性回归。但是当我运行我的程序时,我得到以下错误:

应为2D数组,改为1D数组:数组=['190''90''45''125' '190' '75' '45' '120' '200' '45' '190' '60' '225' '90' '3' '45' '150' '120' '45' '3' '3' '255' '90' '45' '45' '45' '255' '225' '190' '190' '255' '90' '45' '45' '30' '45' '45' '90' '190' '90' '45' '90' '60' '45' '60' '75' '55' '75' '45' '45' '3' '255' '45' '3' '45' '90' '190' '60' '190' '200' '225' '75' '45' '45' '45' '200' '120' '120' '255' '60' '45' '45' '75' '60' '60' '190' '75' '45' '120' '190' '200' '235' '45' '45' '90' '30' '45' '45' '170' '235' '45' '190' '60' '75' '180' '45' '235' '190' '45' '120' '45' '75' '190' '45' '45' '45' '45' '45' '75' '45' '45' '190' '45' '75' '3' '45' '60' '200' '45' '45' '255' '255' '120' '45' '255' '125' '120' '60' '45' '45' '60' '255' '45' '180' '60' '45' '60' '3' '25' '120' '45' '3' '3' '45' '75' '30' '45' '255' '30' '75' '255' '255' '180' '255' '45' '45' '120' '255' '75' '30' '45' '75' '45' '255' '120' '45' '45' '45' '190' '45' '75' '45' '45' '3' '60' '30' '60' '200' '45' '75' '120' '25' '255' '45' '255' '200' '190' '190' '120' '45' '90' '170' '45' '75' '60' '100' '45' '45' '90' '45' '45' '45' '255' '60' '90' '140' '45' '90' '75' '200' '45' '45' '255' '120' '3' '45' '75' '200' '255' '225' '120' '120' '200' '45' '45' '50' '190' '45' '45' '45' '45' '45' '45' '30' '3' '3' '255' '45' '45' '255' '120' '225' '45' '75' '75' '45' '60' '255' '60' '60' '45' '120' '255' '45' '225' '255' '45' '45' '3' '255' '190' '30' '190' '45' '45' '120' '75' '25' '75' '255' '45' '120' '100' '3' '65' '45' '75' '180' '45' '45' '3' '255' '45' '45' '90' '225' '190' '45' '255' '3' '190' '70' '3' '120' '45' '45' '50' '200' '190' '255' '55' '150' '45' '3' '25' '60' '45' '120' '45' '205' '60' '45' '45' '255' '30' '120' '75' '45' '90' '45' '45' '60' '190' '45' '45' '90' '45' '3' '75' '90' '200' '180' '45' '45' '75' '90' '45' '3' '120' '45' '45' '45' '45' '75' '45' '155' '45' '55' '45' '30' '45' '150' '255' '45' '75' '180' '15' '190' '255' '75' '190' '45' '190' '90' '255' '45' '45' '45' '190' '3' '60' '45' '60' '60' '255' '25' '145' '45' '45' '120' '50' '45' '120' '45' '255' '45' '45' '45' '50' '225' '30' '75' '120' '3' '45' '120' '30' '45' '255' '90' '3' '3' '120' '45' '127' '120' '200' '255' '25' '45' '75' '120' '255' '190' '220' '45' '65' '45' '90' '60' '200' '190' '190' '120' '190' '90' '45' '120' '75' '190' '75' '90' '120' '90' '75' '45' '190' '45' '100' '60' '3' '45' '90' '190' '255' '45' '190' '45' '45' '25' '60' '60' '45' '190' '45' '190' '30' '190' '45' '190' '255' '45' '45' '3' '120' '3' '45' '35' '120' '190' '255' '190' '45' '45' '45' '45' '255' '190' '45' '190' '225' '45' '190' '255' '45' '190' '45' '255' '75' '45' '90' '120' '30' '180' '190' '100' '255' '235' '75' '60' '190' '160' '45' '3' '120' '45' '3' '120' '45' '45' '45' '127' '75' '190' '140' '75' '225' '60' '45' '75' '120' '190' '190' '90' '3' '45' '150' '120' '30' '50''45''60''190''255''125''120''75''60''90''140']

使用数组重塑数据的形状。如果数据具有单个数组,则重塑(-1,1) 特征或数组。如果包含单个样本,则重塑(1,-1)

为了修复我的错误,我尝试重新调整我的x_火车列表,因为它似乎就是上面提到的那个,但我仍然得到相同的错误。也许我的语法不正确?我从另一个建议中尝试了
x\u-train.reformate(-1,1)
x\u-train=x\u-train.reformate(-1,1)
,但没有成功

以下是我迄今为止编写的(粗略)代码:

from sklearn import cross_validation
from sklearn import svm
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import numpy as np
import matplotlib as plt
import csv

# Create linear regression object
regr = linear_model.LinearRegression()

# Create lists and append data -- we want to predict the catch rate!
total = []
catch_rate = []

with open("pokemon.csv") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    for row in reader:
        total.append(row[5])
        catch_rate.append(row[21])

x_train, x_test, y_train, y_test = 
cross_validation.train_test_split(catch_rate, total, test_size=0.25, 
random_state=0)


# Train the model using the training sets
regr.fit(x_train, y_train)

# Make predictions using the testing set
pokemon_y_pred = regr.predict(x_test)

# Plot outputs
plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, pokemon_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

也许我在理解代码的过程中忽略了其他一些东西?同样,我在自学,所以我非常感谢任何帮助。

使用
pandas
dataframe而不是
list
类型。 还要注意的是,
train\u test\u split
函数的第一个元素需要是至少有两列的数据帧


因此,假设您的csv文件如下所示:

Id,Name,Type_1,Type_2,Total,HP,Attack,Defense,Sp_Atk,Sp_Def,Speed,Generation,isLegendary,Color,hasGender,Pr_Male,Egg_Group_1,Egg_Group_2,hasMegaEvolution,Height_m,Weight_kg,Catch_Rate
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,Green,True,0.875,Monster,Grass,False,0.71,6.9,45
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,Green,True,0.875,Monster,Grass,False,0.99,13,45
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,Green,True,0.875,Monster,Grass,True,2.01,100,45
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,Red,True,0.875,Monster,Dragon,False,0.61,8.5,45
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,Red,True,0.875,Monster,Dragon,False,1.09,19,45
并使用以下代码:

from sklearn import cross_validation
from sklearn import svm
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import numpy as np
import matplotlib as plt
import pandas as pd #import pandas

# Create linear regression object
regr = linear_model.LinearRegression()

#load csv file with pandas
df = pd.read_csv("pokemon.csv")
#remove all string columns
df = df.drop(['Name', 'Type_1','Type_2','Color','Egg_Group_1','Egg_Group_2'], axis=1)

y= df.Catch_Rate

x_train, x_test, y_train, y_test = cross_validation.train_test_split(df, y, test_size=0.25, random_state=0)


# Train the model using the training sets
regr.fit(x_train, y_train)

# Make predictions using the testing set
pokemon_y_pred = regr.predict(x_test)

print pokemon_y_pred


# [ code continuation ...]
您将获得:

[ 45.  45.]

异常stacktrace?错误表示模型无法理解您在何处提供具有多个功能的一个实例的数据集或具有仅具有一个功能的多个实例的数据集。这就是为什么它要求你重塑。(我假设您有许多行具有一个功能,因此
。重塑(-1,1)
应该可以工作)。如果在重塑
X
以使其成为二维矩阵后,它不起作用,则应显示您面临的错误。@SeljukGülcan可能我在代码中的错误位置使用了.reformate?在分割测试数据之后,但在使用训练集训练模型之前,我添加了
x\u train.reformate(-1,1)
,并出现了一个新错误:
AttributeError:'list'对象没有属性“reformate”
@GarrettMcClure,位置似乎正确。尝试
x\u-train=np.array(x\u-train)。重塑(-1,1)
。您也应该对
x_test
执行同样的操作。@SeljukGülcan现在我得到一个新错误:
TypeError:无法将数组数据从dtype('float64')强制转换为dtype('奇怪,我得到另一个值错误:
ValueError:输入包含NaN、无穷大或一个对于dtype('float64')来说太大的值.
我还删除了真/假表:
df=df.drop(['Name'、'Type_1'、'Type_2'、'Type_2'、'isLegendary'、'Color'、'hassex'、'Egg_Group_1'、'Egg_Group_2'、'hasmagevolution'、'Body_Style',axis=1)