Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/332.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 大数据集的预测分析_Python_Scikit Learn_Regression - Fatal编程技术网

Python 大数据集的预测分析

Python 大数据集的预测分析,python,scikit-learn,regression,Python,Scikit Learn,Regression,我已经能够成功地使用SVR预测一个数据集上的值,只需输入一个数据项。然而,我的数据集每行或每项有47个条目,或者你想叫它什么。我已经上传了我的数据集csv,在我的代码中,我注释掉了get_data函数中的其他46个条目 所有47个数据项都是相对的,影响x,即球员的工资。我试图预测一名球员的未来工资,只使用球员工资已知之前的统计数据。然而,正如我提到的,很多统计数据定义了工资,目前我只能对1个统计条目进行预测 import csv import numpy as np from sklearn.s

我已经能够成功地使用SVR预测一个数据集上的值,只需输入一个数据项。然而,我的数据集每行或每项有47个条目,或者你想叫它什么。我已经上传了我的数据集csv,在我的代码中,我注释掉了get_data函数中的其他46个条目

所有47个数据项都是相对的,影响x,即球员的工资。我试图预测一名球员的未来工资,只使用球员工资已知之前的统计数据。然而,正如我提到的,很多统计数据定义了工资,目前我只能对1个统计条目进行预测

import csv
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt

salary = []
stats = []

def get_data(filename):
    with open(filename, 'r', encoding='utf8', errors='ignore') as csvfile:
        csvFileReader = csv.reader(csvfile)
        for row in csvFileReader:
#            stats.append(float(row[4]))   # 
#            stats.append(int(row[5]))         #
            salary.append(float(row[6]))
#            stats.append(int(row[8]))        #
#            stats.append(int(row[9]))        #
#            stats.append(int(row[10]))         #
            stats.append(int(row[11]))      #
#            stats.append(int(row[12]))        #
#            stats.append(int(row[13]))        #
#            stats.append(float(row[14]))      #
#            stats.append(int(row[15]))        #
#            stats.append(int(row[16]))       #
#            stats.append(int(row[17]))       #
#            stats.append(int(row[18]))        #
#            stats.append(int(row[19]))           #
#            stats.append(int(row[20]))           #
#            stats.append(int(row[21]))             #
#            stats.append(int(row[22]))            #
#            stats.append(int(row[23]))            #
#            stats.append(int(row[24]))            #
#            stats.append(float(row[25]))          #
#            stats.append(int(row[26]))            #
#            stats.append(int(row[27]))           #
#            stats.append(int(row[28]))           #
#            stats.append(int(row[29]))            #
#            stats.append(int(row[30]))            #
#            stats.append(int(row[31]))            #
#            stats.append(int(row[32]))              #
#            stats.append(int(row[33]))             #
#            stats.append(int(row[34]))             #
#            stats.append(int(row[35]))             #
#            stats.append(float(row[36]))           #
#            stats.append(int(row[37]))             #
#            stats.append(int(row[38]))            #
#            stats.append(int(row[39]))            #
#            stats.append(int(row[40]))             #
#            stats.append(int(row[41]))            #
#            stats.append(int(row[42]))            #
#            stats.append(int(row[43]))              #
#            stats.append(int(row[44]))             #
#            stats.append(int(row[45]))             #
#            stats.append(int(row[46]))             #
#            stats.append(float(row[47]))           #
#            stats.append(int(row[48]))             #
#            stats.append(int(row[49]))             #
#            stats.append(int(row[50]))            #
#            stats.append(int(row[51]))            #
#            stats.append(int(row[52]))            #
    return

get_data('dataset.csv')

def predict_salary(stats, salary, x):
    stats = np.reshape(stats,(len(salary), int(len(stats)/len(salary))))

    svr_lin = SVR(kernel='linear', C=1e3, epsilon=0.2, cache_size=7000)
    svr_rbf = SVR(kernel= 'rbf', C=1e3, gamma=0.1, cache_size=7000)
    svr_poly = SVR(kernel='poly', C=1e3, degree=2, cache_size=7000)
    svr_lin.fit(stats, salary)
    svr_rbf.fit(stats, salary)
    svr_poly.fit(stats, salary)

    plt.scatter(stats, salary, color='black', label='Data')
    plt.plot(stats, svr_lin.predict(stats), color='green', label='Linear model')
    plt.plot(stats, svr_rbf.predict(stats), color='red', label='RBF model')
    plt.plot(stats, svr_poly.predict(stats), color='blue', label='Polynomial model')
    plt.xlabel('Stats')
    plt.ylabel('Salary')
    plt.title('Support Vector Regression')
    plt.legend()
    plt.show()

    return svr_lin.predict(x)[0], svr_rbf.predict(x)[0], svr_poly.predict(x)[0]


projected_salary = predict_salary(stats, salary, 1)

print (projected_salary)
这是dataset.csv,我只包含了10行数据,但最多包含200行数据:

N/A,N/A,player 1,team,3,26,1350000,508500,22,31,32,8,361,3,0.217,0,0,0,0,25,33,48,11,390,13,0.256,0,0,0,0,9,18,22,1,225,4,0.215,0,0,0,0,22,27,37,8,313,9,0.192,0,0,0,0,0
N/A,N/A,player 2,team,3,27,805000,508500,15,26,17,4,176,1,0.242,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,1,2,0,13,0,0.231,0,0,0,0,10,10,17,1,168,1,0.201,0,0,0,0,0
N/A,N/A,player 3,team,3,25,2625000,508500,25,17,69,3,460,58,0.26,0,0,0,0,15,28,56,4,454,57,0.226,0,0,0,0,39,48,72,6,611,56,0.25,0,0,0,0,2,1,9,0,22,13,0.368,2,0,0,0,0
N/A,N/A,player 4,team,3,26,3575000,508500,65,81,73,30,601,6,0.243,0,0,0,0,37,46,44,11,497,13,0.258,0,0,0,0,29,36,47,10,411,4,0.221,0,0,0,1,25,36,41,8,335,5,0.265,0,0,0,0,0
N/A,N/A,player 5,team,3,28,1950000,508500,23,34,45,7,324,4,0.255,0,0,0,0,35,45,56,2,509,8,0.28,1,0,0,0,32,29,68,4,492,12,0.281,0,0,0,0,5,14,15,0,144,1,0.25,0,0,0,0,0
N/A,N/A,player 6,team,2.5,30,700000,508500,3,0,7,0,141,0,0.174,0,0,0,0,28,49,38,11,355,0,0.234,0,0,0,0,18,28,22,9,275,0,0.207,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N/A,N/A,player 7,team,2.5,26,2550000,508500,31,39,67,6,622,17,0.294,1,0,0,0,25,35,57,1,452,19,0.272,0,0,0,0,3,4,13,1,125,1,0.237,0,0,0,0,5,10,17,0,131,0,0.289,0,0,0,0,0
N/A,N/A,player 8,team,3,28,938000,508500,15,28,21,6,166,4,0.284,0,0,0,0,8,10,13,2,113,0,0.146,0,0,0,0,3,4,8,0,79,1,0.213,0,0,0,0,11,19,16,4,197,0,0.189,0,0,0,0,0
N/A,N/A,player 9,team,3,24,2300000,508500,40,49,52,5,466,21,0.277,0,0,0,0,36,43,59,4,552,16,0.227,0,0,0,0,27,26,34,6,332,8,0.261,0,0,0,0,5,5,5,0,61,2,0.291,0,0,0,0,0
N/A,N/A,player 10,team,3,27,3025000,508500,63,70,57,24,548,0,0.245,0,0,0,0,30,31,30,10,234,0,0.304,0,0,0,0,57,76,74,24,478,8,0.312,0,0,0,0,23,17,32,5,213,2,0.263,0,0,0,0,0
我花了几天的时间,甚至用了47个条目中的一个来完成这项工作,还花了更多的时间试图找出如何让它分析每个玩家的整个集合。我是python的初学者,没有任何统计背景,所以我现在完全迷路了!感谢您的帮助和指导,谢谢

我会使用,因为你所采用的注释行的方法是痛苦的,至少可以这么说

import pandas

# list of columns (features) you'd like to use
columns_of_interest = [11, 15, 20, 26] # features you'd like to use (stats). You only used 11 but you could use many more

df = pandas.read_csv(filename, header=None)
stats = df[df[columns_of_interest]].values # select columns of interest

salary = df[6].values   # salary column, which is in column 6
然后,您可以使用sklearn的
train\u test\u split
。这将使您能够将数据拆分为培训和测试

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(stats, salary)
您可以将其发送到预测功能:

pred_lin, pred_rbf, pred_poly = predict_salary(x_train, y_train, x_test)
我添加了三个参数,因为函数返回三组预测,每个预测来自每个SVR模型

此外,我只需将函数的
return
更改为:

svr_lin.predict(x), svr_rbf.predict(x), svr_poly.predict(x)
这将从测试集中返回整个预测集

使用下面的代码,应该可以工作。
顺便说一句,一个数据集的200行远不是它得到的“巨大”。现在的海量数据集是以TB为单位计算的,所以熊猫方法与我目前使用的方法在同一个数组中是相等的,但是它显然更干净。谢谢你。因此,我在predict_salary函数中注释掉了统计数据的np.Reformate。我去掉了predict_salary函数返回字符串上的[0]slice调用。我添加了train_测试,当通过1列数据时,我能够使它成功工作。但是如果我加了超过1,那么多边形模型就会锁定,线性和rbf模型会产生误差:raise ValueError(“x和y必须是相同的大小”),具有讽刺意味的是,出于某种原因,这4列是少数几个可以工作的列。如果我添加所有需要的列,4、5、9-52,那么脚本就挂起了。还有,请原谅我,我仍然不知道这是怎么回事。如果我能够通过所需的所有列,我基本上能够预测第6列,即新玩家的工资,我在第4、5、9-52列中有“功能”,但在第6列中没有该新玩家的功能?你需要将除第6列以外的所有列传递到感兴趣的
列。这样做的一种方法是:
columns\u of_interest=[i for i in df.columns if i!=6]
这需要在调用
df=pandas.read\u csv(filename,header=None)
后声明。如果答案符合您的目的,请您投票/接受,我将不胜感激。谢谢。我投了赞成票,非常感谢你的帮助,因为我现在迷路了。但它不能满足我的需要。如果我添加更多功能,脚本当前将挂起。另外,我认为它所做的是预测每一个功能的薪水…实际上我需要做的是预测整行的薪水,而不是一行的每一个功能。如果有意义的话,我想把一行的特性作为一个条目来处理?
import csv
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
import pandas
from sklearn.model_selection import train_test_split



def predict_salary(stats, salary, x):

    svr_lin = SVR(kernel='linear', C=1e3, epsilon=0.2, cache_size=7000)
    svr_rbf = SVR(kernel= 'rbf', C=1e3, gamma=0.1, cache_size=7000)
    svr_poly = SVR(kernel='poly', C=1e3, degree=2, cache_size=7000)
    svr_lin.fit(stats, salary)
    svr_rbf.fit(stats, salary)
    svr_poly.fit(stats, salary)

    # plt.scatter(stats, salary, color='black', label='Data')
    plt.scatter(salary, svr_lin.predict(stats), color='green', label='Linear model')
    plt.scatter(salary, svr_rbf.predict(stats), color='red', label='RBF model')
    plt.scatter(salary, svr_poly.predict(stats), color='blue', label='Polynomial model')
    plt.xlabel('Actual Salary')
    plt.ylabel('Salary Predictions')
    plt.title('Support Vector Regression')
    plt.legend()
    plt.show()

    return svr_lin.predict(x), svr_rbf.predict(x), svr_poly.predict(x)



filename = '/Users/carlomazzaferro/Desktop/p.csv'

columns_of_interest = [11, 15, 20, 26]

df = pandas.read_csv(filename, header=None)
stats = df[columns_of_interest].values # select columns of interest

salary = df[6].values   # salary column, which is in column

x_train, x_test, y_train, y_test = train_test_split(stats, salary)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)


pred_lin, pred_rbf, pred_poly = predict_salary(x_train, y_train, x_test)