正在解决spark/scala中使用stanford nlp进行名称提取时出现的task not serializable错误

正在解决spark/scala中使用stanford nlp进行名称提取时出现的task not serializable错误,scala,apache-spark,serialization,stanford-nlp,Scala,Apache Spark,Serialization,Stanford Nlp,我试图使用Spark/Scala中的stanford ner包从文本中提取名称。我在build.sbt中添加了以下内容: libraryDependencies++=Seq( "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0", "org.scalatest" %% "scalatest" % "3.0.0-M9" ) 此外,我还创建了一个RDD,其中每个元素都是文本(字符串集)。然后,我定义了一个名为“ner”的函数,它将文本作为输入,并从中返

我试图使用Spark/Scala中的stanford ner包从文本中提取名称。我在build.sbt中添加了以下内容:

libraryDependencies++=Seq(
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"org.scalatest" %% "scalatest" % "3.0.0-M9"
)
此外,我还创建了一个RDD,其中每个元素都是文本(字符串集)。然后,我定义了一个名为“ner”的函数,它将文本作为输入,并从中返回名称。为了实现名称提取,以下是代码的相关部分:

val serializedclassifier = "/home/hadoopuser/stanfordner/stanford-ner-2016-10-31/classifiers/german.conll.hgc_175m_600.crf.ser.gz"
val classifier = CRFClassifier.getClassifierNoExceptions(serializedclassifier)
def ner (a: String):String={
  val out = classifier.classify(a._2)
   ...
   ...
   ...}
当我这么做的时候,代码会给我起名字

rdd.take(10).foreach(x=>println(ner(x)))
但当我这么做的时候

val rdd2 = rdd.map(x=>ner(x))
它引发了以下错误:

Loading classifier from /home/hadoopuser/stanfordner/stanford-ner-2016-10-31/classifiers/german.conll.hgc_175m_600.crf.ser.gz ... done [0.8 sec].
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.map(RDD.scala:365)
    at org.inno.redistagger.redistagger$.main(correcttags.scala:220)
    at org.inno.redistagger.redistagger.main(correcttags.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: edu.stanford.nlp.ie.crf.CRFClassifier
Serialization stack:
    - object not serializable (class: edu.stanford.nlp.ie.crf.CRFClassifier, value: edu.stanford.nlp.ie.crf.CRFClassifier@56dd6efa)
    - field (class: org.inno.redistagger.redistagger$$anonfun$9, name: classifier$1, type: class edu.stanford.nlp.ie.crf.CRFClassifier)
    - object (class org.inno.redistagger.redistagger$$anonfun$9, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
    ... 20 more

这解决了上述错误,但出现了一个新问题。上面的“ner”函数为rdd的每个元素创建一个变量分类器,rdd大约有500万个元素。创建基本上连接到ner库的变量需要0.5秒,因此代码需要花费大量时间才能完成。如何解决此问题或序列化分类器而不在“ner”函数中定义它?

您是否尝试将CRFClassizer作为对象或CRFClassizer扩展Serializable我的意思是
CRFClassizer.getClassifierNoExceptions(serializedclassifier)
应该在闭包中,即在ner函数中,或者您需要序列化CRFClassizer并将其放在闭包之外。另一个不可序列化的wise任务异常将出现。如果您想将分类器放在侧闭包中,则使用mappartitions,而不是rdd.map check,这样分类器将在每个分区而不是每个元素初始化一次。这是我的问题。如何在函数外部序列化CRFClassizer?如果需要,您需要序列化CRFClassizer并将其放在闭包之外您是否尝试将CRFClassizer作为对象或CRFClassizer扩展Serializable我的意思是
CRFClassizer.getClassifierNoExceptions(serializedclassifier)
应该在闭包中,即在ner函数中,或者您需要序列化CRFClassizer并将其放在闭包之外。另一个不可序列化的wise任务异常将出现。如果您想将分类器放在侧闭包中,则使用mappartitions,而不是rdd.map check,这样分类器将在每个分区而不是每个元素初始化一次。这是我的问题。如何在函数外部序列化CRFClassizer?如果需要,需要序列化CRFClassizer并将其放在闭包之外
val serializedclassifier = "/home/hadoopuser/stanfordner/stanford-ner-2016-10-31/classifiers/german.conll.hgc_175m_600.crf.ser.gz"
def ner (a: String):String={
   val classifier = CRFClassifier.getClassifierNoExceptions(serializedclassifier)
  val out = classifier.classify(a._2)
   ...
   ...
   ...}