使用Python查找相关对_Python_Pandas_Machine Learning_Data Mining

使用Python查找相关对

python pandas machine-learning

使用Python查找相关对,python,pandas,machine-learning,data-mining,Python,Pandas,Machine Learning,Data Mining,假设我有销售统计数据。我想知道：价格/等对销售有何影响？检测哪些功能最具影响力？哪一个应该优化价格以实现最大销售额？请告知哪些Python库可以在这里提供帮助？这里的任何例子都会很棒 python机器学习库scikit learn最适合您的情况。有一个名为“功能选择”的子模块正好适合您的需要。这里有一个例子 NAME PRICE SALES VIEWS AVG_RATING VOTES COMMENTS Module 1 $12.00

假设我有销售统计数据。我想知道：

价格/等对销售有何影响？检测哪些功能最具影响力？哪一个应该优化价格以实现最大销售额？

请告知哪些Python库可以在这里提供帮助？这里的任何例子都会很棒

python机器学习库scikit learn最适合您的情况。有一个名为“功能选择”的子模块正好适合您的需要。这里有一个例子

    NAME    PRICE   SALES   VIEWS   AVG_RATING  VOTES   COMMENTS    
Module 1    $12.00     69   12048           5      3          26    
Module 2    $24.99     12   52858           5      1          14    
Module 3    $10.00      1   1381           -1      0           0    
Module 4    $22.99     46   57841           5      8          24    
.................

从图表中，我们可以看到该模型做得相当好：我们的模型捕捉到了“销售”的大部分变化。

要求我们推荐或查找书籍、工具、软件库、教程或其他非现场资源的问题，因为它们往往会吸引固执己见的答案和垃圾邮件，因此与堆栈溢出无关。相反，请描述问题以及迄今为止为解决问题所做的工作。

from sklearn.datasets import make_regression

# simulate a dataset with 500 factors, but only 5 out of them are truely 
# informative factors, all the rest 495 are noises. assume y is your response
# variable 'Sales', and X are your possible factors
X, y = make_regression(n_samples=1000, n_features=500, n_informative=5, noise=5)

X.shape
Out[273]: (1000, 500)
y.shape
Out[274]: (1000,)

from sklearn.feature_selection import f_regression
# regressing Sales on each of factor individually, get p-values
_, p_values = f_regression(X, y)
# select significant factors p < 0.05
mask = p_values < 0.05
X_informative = X[:, mask]

X_informative.shape
Out[286]: (1000, 38)

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators=100)
# fit our model
gbr.fit(X_informative, y)
# generate predictions
gbr_preds = gbr.predict(X_informative)

# calculate erros and plot it
gbr_error = y - gbr_preds

fig, ax = plt.subplots()
ax.hist(y, label='y', alpha=0.5)
ax.hist(gbr_error, label='errors in predictions', alpha=0.4)
ax.legend(loc='best')