Python PySpark:如何评估ML推荐算法的AUC?
我有一个Spark数据框,如下所示:Python PySpark:如何评估ML推荐算法的AUC?,python,apache-spark,pyspark,apache-spark-mllib,apache-spark-ml,Python,Apache Spark,Pyspark,Apache Spark Mllib,Apache Spark Ml,我有一个Spark数据框,如下所示: predictions.show(5) +------+----+------+-----------+ | user|item|rating| prediction| +------+----+------+-----------+ |379433| 31| 1| 0.08203495| | 1834| 31| 1| 0.4854447| |422635| 31| 1|0.017672742| | 839| 31|
predictions.show(5)
+------+----+------+-----------+
| user|item|rating| prediction|
+------+----+------+-----------+
|379433| 31| 1| 0.08203495|
| 1834| 31| 1| 0.4854447|
|422635| 31| 1|0.017672742|
| 839| 31| 1| 0.39273006|
| 51444| 31| 1| 0.09795039|
+------+----+------+-----------+
only showing top 5 rows
scoresnLabels.take(5)
Out[81]:
[(0.08203495293855667, 1),
(0.48544469475746155, 1),
(0.017672741785645485, 1),
(0.39273005723953247, 1),
(0.09795039147138596, 1)]
预测是预测评级,评级是隐含评级(计数)
现在我想检查我的推荐算法的AUC
我首先尝试了pyspark.ml.BinaryClassificationEvaluator
,因为它直接在数据帧上工作
# getting the evaluationa metric
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
print evaluator.evaluate(predictions)
这给了我以下错误:
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-65-c642ea9c2cf5> in <module>()
4
5 evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
----> 6 print evaluator.evaluate(predictions)
7
8 #print evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
/Users/i854319/spark/python/pyspark/ml/evaluation.py in evaluate(self, dataset, params)
67 return self.copy(params)._evaluate(dataset)
68 else:
---> 69 return self._evaluate(dataset)
70 else:
71 raise ValueError("Params must be a param map but got %s." % type(params))
/Users/i854319/spark/python/pyspark/ml/evaluation.py in _evaluate(self, dataset)
97 """
98 self._transfer_params_to_java()
---> 99 return self._java_obj.evaluate(dataset._jdf)
100
101 def isLargerBetter(self):
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
51 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
52 if s.startswith('java.lang.IllegalArgumentException: '):
---> 53 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
54 raise
55 return deco
IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.'
Exception AttributeError: "'BinaryClassificationMetrics' object has no attribute '_sc'" in <bound method BinaryClassificationMetrics.__del__ of <pyspark.mllib.evaluation.BinaryClassificationMetrics object at 0x126483d50>> ignored
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-82-81c08d4e6f1d> in <module>()
3 from pyspark.mllib.evaluation import BinaryClassificationMetrics
4 metrics=BinaryClassificationMetrics(scoresnLabels)
----> 5 metrics.areaUnderROC
/Users/i854319/spark/python/pyspark/mllib/evaluation.py in areaUnderROC(self)
60 (ROC) curve.
61 """
---> 62 return self.call("areaUnderROC")
63
64 @property
/Users/i854319/spark/python/pyspark/mllib/common.pyc in call(self, name, *a)
144 def call(self, name, *a):
145 """Call method of java_model"""
--> 146 return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
147
148
/Users/i854319/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling o562.areaUnderROC.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1505.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1505.0 (TID 9224, localhost): java.lang.NullPointerException: Value at index 1 in null
at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4$lzycompute(BinaryClassificationMetrics.scala:153)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4(BinaryClassificationMetrics.scala:144)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:146)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:146)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:222)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:85)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: Value at index 1 in null
at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
如下所示:
predictions.show(5)
+------+----+------+-----------+
| user|item|rating| prediction|
+------+----+------+-----------+
|379433| 31| 1| 0.08203495|
| 1834| 31| 1| 0.4854447|
|422635| 31| 1|0.017672742|
| 839| 31| 1| 0.39273006|
| 51444| 31| 1| 0.09795039|
+------+----+------+-----------+
only showing top 5 rows
scoresnLabels.take(5)
Out[81]:
[(0.08203495293855667, 1),
(0.48544469475746155, 1),
(0.017672741785645485, 1),
(0.39273005723953247, 1),
(0.09795039147138596, 1)]
然后我在下面的评估器中使用它
### Using the mllib evaluation metric
from pyspark.mllib.evaluation import BinaryClassificationMetrics
metrics=BinaryClassificationMetrics(scoresnLabels)
metrics.areaUnderROC
但我得到了以下错误:
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-65-c642ea9c2cf5> in <module>()
4
5 evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
----> 6 print evaluator.evaluate(predictions)
7
8 #print evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
/Users/i854319/spark/python/pyspark/ml/evaluation.py in evaluate(self, dataset, params)
67 return self.copy(params)._evaluate(dataset)
68 else:
---> 69 return self._evaluate(dataset)
70 else:
71 raise ValueError("Params must be a param map but got %s." % type(params))
/Users/i854319/spark/python/pyspark/ml/evaluation.py in _evaluate(self, dataset)
97 """
98 self._transfer_params_to_java()
---> 99 return self._java_obj.evaluate(dataset._jdf)
100
101 def isLargerBetter(self):
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
51 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
52 if s.startswith('java.lang.IllegalArgumentException: '):
---> 53 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
54 raise
55 return deco
IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.'
Exception AttributeError: "'BinaryClassificationMetrics' object has no attribute '_sc'" in <bound method BinaryClassificationMetrics.__del__ of <pyspark.mllib.evaluation.BinaryClassificationMetrics object at 0x126483d50>> ignored
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-82-81c08d4e6f1d> in <module>()
3 from pyspark.mllib.evaluation import BinaryClassificationMetrics
4 metrics=BinaryClassificationMetrics(scoresnLabels)
----> 5 metrics.areaUnderROC
/Users/i854319/spark/python/pyspark/mllib/evaluation.py in areaUnderROC(self)
60 (ROC) curve.
61 """
---> 62 return self.call("areaUnderROC")
63
64 @property
/Users/i854319/spark/python/pyspark/mllib/common.pyc in call(self, name, *a)
144 def call(self, name, *a):
145 """Call method of java_model"""
--> 146 return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
147
148
/Users/i854319/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling o562.areaUnderROC.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1505.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1505.0 (TID 9224, localhost): java.lang.NullPointerException: Value at index 1 in null
at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4$lzycompute(BinaryClassificationMetrics.scala:153)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4(BinaryClassificationMetrics.scala:144)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:146)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:146)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:222)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:85)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: Value at index 1 in null
at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics$$anonfun$$init$$1.apply(BinaryClassificationMetrics.scala:61)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
2) 其次,如果我尝试使用这两个包,为什么会出现错误。***
好吧,很多Spark版本后来
…对于Spark 2.4.5,您的第一个示例仍然会抛出一个错误,但现在它更有帮助:
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column prediction must be of type equal to one of the following types: [double, struct<type:tinyint,size:int,indices:array<int>,values:array<double>>] but was actually of type float.'
。。。API的ML版本最好用于数据帧,而API的MLLIB版本最好用于RDD数据结构。这两种方法在不同的情况下都很有用,可能重复@LostInOverflow,我得到了一些。但是,应该能够使用BinaryClassificationEvaluator并获得AUC。我使用SKlearn在Python中实现了这一点。我知道隐式反馈既没有正面反馈也没有负面反馈,但人们仍然可以从AUC中获得一些意义。您认为代码有什么问题,没有给出正确的输出。我应该将隐式计数二进制化,然后进行比较吗?。我是用python做的,但我认为spark会自动处理这个问题。我不认为这可以解释这一点。这篇文章没有提到任何错误。只是没有给出正确的答案。我的是我在运行时遇到的一个错误。你如何计算你的预测?你愿意和我们分享这部分代码吗?@eilasah为ALS添加了代码。