Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/cplusplus/147.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala SparkException:添加向量列后,要汇编的值不能为null_Scala_Apache Spark - Fatal编程技术网

Scala SparkException:添加向量列后,要汇编的值不能为null

Scala SparkException:添加向量列后,要汇编的值不能为null,scala,apache-spark,Scala,Apache Spark,Windows上的Spark 2.1(独立版)。添加VectorAssembler列后无法将spark数据框保存到拼花地板文件。 在矢量列之前保存数据帧没有问题,所有“功能”都不为空(使用NVL) printSchema和show的输出(5): 例外情况: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<c9003_double_vecAssembler

Windows上的Spark 2.1(独立版)。添加VectorAssembler列后无法将spark数据框保存到拼花地板文件。 在矢量列之前保存数据帧没有问题,所有“功能”都不为空(使用NVL)

printSchema和show的输出(5):

例外情况:

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<c9003_double_vecAssembler_41f4486b7bab:double,c0022_double_vecAssembler_41f4486b7bab:double,c9014_double_vecAssembler_41f4486b7bab:double,c9008_double_vecAssembler_41f4486b7bab:double,a8401_double_vecAssembler_41f4486b7bab:double,c0021:double,d1417_double_vecAssembler_41f4486b7bab:double,d0006_double_vecAssembler_41f4486b7bab:double,c0023_double_vecAssembler_41f4486b7bab:double,d1501_double_vecAssembler_41f4486b7bab:double,c0020_double_vecAssembler_41f4486b7bab:double,d0007_double_vecAssembler_41f4486b7bab:double,c0024_double_vecAssembler_41f4486b7bab:double,c4018_double_vecAssembler_41f4486b7bab:double,at180_double_vecAssembler_41f4486b7bab:double,c1421_double_vecAssembler_41f4486b7bab:double>) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:160)
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:143)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:143)
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:99)
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:98)
    ... 16 more
添加向量列后:

root
 |-- label: string (nullable = false)
 |-- c9003: integer (nullable = true)
 |-- c9014: integer (nullable = true)
 |-- features: vector (nullable = true)
UPDATE2:看起来内存/数据卷中存在问题。我已尝试在SQL中添加筛选器:

  • 强制转换(NVL(C90149999)为int)>1000->工作正常

  • 强制转换(NVL(C90149999)为int)工作正常

  • c9014上没有过滤器->引发异常


  • 关于内存调优的任何提示?

    问题已解决,SQL查询“cast(NVL(c9014,0)作为int)作为c9014”中的问题
    这段代码可能会产生空值,CAST()应该在NVL()之前使用。

    问题已解决,SQL查询中的问题“CAST(NVL(c9014,0)as int)as c9014”
    这段代码可以生成空值,在NVL()之前应该使用CAST()。

    你可以共享几行你的数据集吗?我已经更新了主帖,看起来像是c9014列中的问题。你可以共享几行你的数据集吗?我已经更新了主帖,看起来像是c9014列中的问题
    org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<c9003_double_vecAssembler_41f4486b7bab:double,c0022_double_vecAssembler_41f4486b7bab:double,c9014_double_vecAssembler_41f4486b7bab:double,c9008_double_vecAssembler_41f4486b7bab:double,a8401_double_vecAssembler_41f4486b7bab:double,c0021:double,d1417_double_vecAssembler_41f4486b7bab:double,d0006_double_vecAssembler_41f4486b7bab:double,c0023_double_vecAssembler_41f4486b7bab:double,d1501_double_vecAssembler_41f4486b7bab:double,c0020_double_vecAssembler_41f4486b7bab:double,d0007_double_vecAssembler_41f4486b7bab:double,c0024_double_vecAssembler_41f4486b7bab:double,c4018_double_vecAssembler_41f4486b7bab:double,at180_double_vecAssembler_41f4486b7bab:double,c1421_double_vecAssembler_41f4486b7bab:double>) => vector)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
        at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:160)
        at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:143)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
        at org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:143)
        at org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:99)
        at org.apache.spark.ml.feature.VectorAssembler$$anonfun$3.apply(VectorAssembler.scala:98)
        ... 16 more
    
    var data = sparkSession.sql("select NVL(target,0) as target, cast(NVL(c9003,0) as int) as c9003, cast(NVL(c9014,0) as int) as c9014 from features where c9014 is not null")
    data.show(20)
    +------+-----+-----+
    |target|c9003|c9014|
    +------+-----+-----+
    |     0|   10|    4|
    |     0|   10|    3|
    |     0|  100|    4|
    |     0|  100|    5|
    |     0|   10|    2|
    |     0|   10|    6|
    |     0|   10|    2|
    |     0|   90|    4|
    |     0|   80|    4|
    |     0|   80|    5|
    |     0|   10|    2|
    |     0|   90|    8|
    |     0|   90|    8|
    |     0|   90|    8|
    |     0|   90|    4|
    |     0|   80|    5|
    |     0|   80|    2|
    |     0|   80|    2|
    |     0|   90|    7|
    |     0|   90|    8|
    +------+-----+-----+
    only showing top 20 rows
    
    root
     |-- label: string (nullable = false)
     |-- c9003: integer (nullable = true)
     |-- c9014: integer (nullable = true)
     |-- features: vector (nullable = true)