Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错
我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor 我的代码是基于以下内容开发的: 这里,x_列和y_列是pyspark数据帧和列表。 基于上述链接,fit()API必须:Python 将pyspark数据帧转换为numpy数组以运行lightgbm时出错,python,numpy,pyspark,databricks,lightgbm,Python,Numpy,Pyspark,Databricks,Lightgbm,我想在pyspark的databricks集群上运行lighGBM.LGBMRegressor 我的代码是基于以下内容开发的: 这里,x_列和y_列是pyspark数据帧和列表。 基于上述链接,fit()API必须: X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix. y (array-like of shape = [n_samples]) – The t
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).
我必须用numpy把X和y变换成数组
new_x_train = np.array(x_train.select(x_train.columns).collect())
但是,尽管我已经为收集分配了足够的内存,但它花费了很长时间。
最后,我得到了一个错误:
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
我的x_train数据可以是40 GB(有3000万行),驱动节点有128 GB内存
有人能帮我找到我为什么会犯这个错误吗
谢谢这似乎是因为内存不足异常,之后,它无法与驱动程序建立新的连接。请尝试以下选项
- 增加驾驶员侧内存,然后重试
- 您可以查看spark作业,它提供了有关数据流的更多信息
**spark.driver.maxResultSize**
附加参考:
希望能有帮助
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.