Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/323.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错_Python_Numpy_Pyspark_Databricks_Lightgbm - Fatal编程技术网

Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错

Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错,python,numpy,pyspark,databricks,lightgbm,Python,Numpy,Pyspark,Databricks,Lightgbm,我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor 我的代码是基于以下内容开发的: 这里,x_列和y_列是pyspark数据帧和列表。 基于上述链接,fit()API必须: X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix. y (array-like of shape = [n_samples]) – The t

我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor

我的代码是基于以下内容开发的:

这里,x_列和y_列是pyspark数据帧和列表。 基于上述链接,fit()API必须:

  X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
  y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).
我必须用numpy把X和y变换成数组

  new_x_train = np.array(x_train.select(x_train.columns).collect())
但是,尽管我已经为收集分配了足够的内存,但它花费了很长时间。 最后,我得到了一个错误:

 Error while obtaining a new communication channel
 ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
我的x_train数据可以是40 GB(有3000万行),驱动节点有128 GB内存

有人能帮我找到我为什么会犯这个错误吗

谢谢

这似乎是因为内存不足异常,之后,它无法与驱动程序建立新的连接。请尝试以下选项

  • 增加驾驶员侧内存,然后重试
  • 您可以查看spark作业,它提供了有关数据流的更多信息
Spark操作是惰性地具体化的,因此所有的处理只有在调用action之后才会发生。若您有时看到OOM错误,那个么我建议检查驱动程序上的heap util和GC模式

另外检查
**spark.driver.maxResultSize**

附加参考:

希望能有帮助

 Error while obtaining a new communication channel
 ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.