Scala Spark:Solrj序列化异常
我正在尝试使用Spark作为“管道”将一些csv文件导入Solr,方法如下:Scala Spark:Solrj序列化异常,scala,apache-spark,solr,solrj,Scala,Apache Spark,Solr,Solrj,我正在尝试使用Spark作为“管道”将一些csv文件导入Solr,方法如下: trait SparkConnector { def connectionWrapper[A](f: SparkContext => A): A = { val sc: SparkContext = getSparkContext() val res: A = f(sc) sc.stop res } } object SolrIndexBuilder extends S
trait SparkConnector {
def connectionWrapper[A](f: SparkContext => A): A = {
val sc: SparkContext = getSparkContext()
val res: A = f(sc)
sc.stop
res
}
}
object SolrIndexBuilder extends SparkConnector {
val solr = new ConcurrentUpdateSolrClient ("http://some-solr-url", 10000, 5)
def run(implicit fileSystem: FileSystem) = connectionWrapper[Unit] {
sparkContext =>
def build(): Unit = {
def toSolrDocument(person: Map[String, String], fieldNames: Array[String]): SolrInputDocument = {
val doc = new SolrInputDocument()
//add the values to the solr doc
doc
}
def rddToMapWithHeaderUnchecked(rdd: RDD[String], header: Array[String]): RDD[Map[String, String]] =
rdd.map { line =>
val splits = new CSVParser().parseLine(line)
header.zip(splits).toMap
}
val indexCsv = sparkContext.textFile("/somewhere/filename.csv")
val fieldNames = Vector("field1", "field2", "field3", ...)
val indexRows = rddToMapWithHeaderUnchecked(indexCsv, fieldNames)
indexRows.map(row => toSolrDocument(row, fieldNames)).foreach { doc => solr.add(doc) }
if (!indexRows.isEmpty()) {
solr.blockUntilFinished()
solr.commit()
}
}
build()
}
}
在群集(Spark模式)上运行代码时,我会出现以下错误:
Caused by: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient
Serialization stack:
- object not serializable (class: org.apache.http.impl.client.SystemDefaultHttpClient, value: org.apache.http.impl.client.SystemDefaultHttpClient@3bdd79c7)
- field (class: org.apache.solr.client.solrj.impl.HttpSolrClient, name: httpClient, type: interface org.apache.http.client.HttpClient)
- object (class org.apache.solr.client.solrj.impl.HttpSolrClient, org.apache.solr.client.solrj.impl.HttpSolrClient@5d2c576a)
- field (class: org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient, name: client, type: class org.apache.solr.client.solrj.impl.HttpSolrClient)
- object (class org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient, org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient@5da7593d)
- field (class: com.oneninetwo.andromeda.solr.SolrIndexBuilder$$anonfun$run$1$$anonfun$build$1$2, name: solr$1, type: class org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient)
- object (class com.oneninetwo.andromeda.solr.SolrIndexBuilder$$anonfun$run$1$$anonfun$build$1$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 26 more
这不是您问题的解决方案,而是另一种选择:如果您查看,您将看到solr添加是在主服务器上完成的(请参阅,而不是在从属服务器上)。您的意思是“solr client”的实例是在“foreachPartition”之外构建的?在这种情况下,我试着这样做,但对我来说不起作用。再多解释一点就好了,有同样的问题这不是你问题的解决方案,而是一个替代方案:如果你看一下,你会发现solr添加是在主服务器上完成的(请看,而不是在从服务器上)。你的意思是“solr客户端”的实例是在“foreachPartition”之外构建的吗?在这种情况下,我试着去做,但对我来说没有用。如果有同样的问题,再多解释一下就好了
indexRows.map(row => toSolrDocument(row, fieldNames)).foreachPartition{ partition =>
val solr = new ConcurrentUpdateSolrClient(...)
partition.foreach { doc =>
solr.add(doc)
}
solr.commit()