Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错_Python_Numpy_Pyspark_Databricks_Lightgbm

Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错

python numpy pyspark

Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错,python,numpy,pyspark,databricks,lightgbm,Python,Numpy,Pyspark,Databricks,Lightgbm,我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor 我的代码是基于以下内容开发的：这里，x_列和y_列是pyspark数据帧和列表。基于上述链接，fit（）API必须： X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix. y (array-like of shape = [n_samples]) – The t

我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor

我的代码是基于以下内容开发的：

这里，x_列和y_列是pyspark数据帧和列表。基于上述链接，fit（）API必须：

  X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
  y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).

我必须用numpy把X和y变换成数组

  new_x_train = np.array(x_train.select(x_train.columns).collect())

但是，尽管我已经为收集分配了足够的内存，但它花费了很长时间。最后，我得到了一个错误：

 Error while obtaining a new communication channel
 ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

我的x_train数据可以是40 GB（有3000万行），驱动节点有128 GB内存

有人能帮我找到我为什么会犯这个错误吗

谢谢

这似乎是因为内存不足异常，之后，它无法与驱动程序建立新的连接。请尝试以下选项

增加驾驶员侧内存，然后重试
您可以查看spark作业，它提供了有关数据流的更多信息

Spark操作是惰性地具体化的，因此所有的处理只有在调用action之后才会发生。若您有时看到OOM错误，那个么我建议检查驱动程序上的heap util和GC模式

另外检查

**spark.driver.maxResultSize**

附加参考：

希望能有帮助

 Error while obtaining a new communication channel
 ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.