elasticsearch elasticsearch spark在设置父项后索引类型失败
我首先使用以下命令通过rest API手动设置类型:
elasticsearch elasticsearch spark在设置父项后索引类型失败,
elasticsearch,apache-spark,
elasticsearch,Apache Spark,我首先使用以下命令通过rest API手动设置类型: curl -XPUT localhost:9200/myIndex/ -d '{ "mappings" : { "company": {}, "people": { "_parent" : { "type" : "company" } } } }' 然而,在s
curl -XPUT localhost:9200/myIndex/ -d '{
"mappings" : {
"company": {},
"people": {
"_parent" : {
"type" : "company"
}
}
}
}'
然而,在spark层,使用以下代码
这里是人员映射
object PeopleDataCleaner {
def main(args: Array[String]): Unit = {
val liftedArgs = args.lift
val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
val elasticsearchHost = liftedArgs(1).getOrElse("52.35.155.55")
val elasticsearchPort = liftedArgs(2).getOrElse("9200")
val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
val spark = SparkSession.builder()
.appName("Mongo Data CLeaner")
.master("local[*]")
.config("spark.mongodb.input.uri", mongoURL)
.config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
.config("mongo.readPreference.name", mongoReadPreferences)
.config("es.nodes", elasticsearchHost)
.config("es.port", elasticsearchPort)
.getOrCreate()
import spark.implicits._
val data = MongoSpark.load[LCDataRecord](spark)
.as[LCDataRecord]
.filter { record =>
record.currentCompanies != null &&
record.currentCompanies.nonEmpty &&
record.linkedinId != null
}
.map { record =>
val moddedCurrentCompanies = record.currentCompanies
.filter { currentCompany => currentCompany.link != null && currentCompany.link != "" }
record.copy(currentCompanies = moddedCurrentCompanies)
}
.flatMap { record =>
record.currentCompanies.map { currentCompany =>
currentCompanyToFlatPerson(record, currentCompany)
}
}
.saveToEs("myIndex/people", Map(
"es.mapping.id" -> "idField",
"es.mapping.parent" -> "companyLink"
))
}
这是公司
object CompanyDataCleaner {
def main(args: Array[String]): Unit = {
val liftedArgs = args.lift
val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
val elasticsearchHost = liftedArgs(1).getOrElse("localhost")
val elasticsearchPort = liftedArgs(2).getOrElse("9200")
val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
val spark = SparkSession.builder()
.appName("Mongo Data CLeaner")
.master("local[*]")
.config("spark.mongodb.input.uri", mongoURL)
.config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
.config("mongo.readPreference.name", mongoReadPreferences)
.config("es.index.auto.create", "true")
.config("es.nodes", elasticsearchHost)
.config("es.port", elasticsearchPort)
.getOrCreate()
import spark.implicits._
val data = MongoSpark
.load[LCDataRecord](spark)
.as[LCDataRecord]
.filter { record => record.currentCompanies != null && record.currentCompanies.nonEmpty }
.flatMap(record => record.currentCompanies)
.filter { record => record.link != null }
.dropDuplicates("link")
.map(formatCompanySizes)
.map(companyToFlatCompany)
.saveToEs("myIndex/company", Map("es.mapping.id" -> "link"))
}
有一条失败消息说明
org.apache.spark.util.TaskCompletionListenerException:如果未配置父字段,则无法指定父字段
。首先将公司索引到elasticsearch中并不是一个问题,我的理解是上面的映射应该定义了父/子关系
编辑在REST上使用批量API或使用普通REST索引API不会遇到此问题 将.config(“es.index.auto.create”,“true”)
设置为.config(“es.index.auto.create”,“false”)
为我解决了这个问题。看起来,即使索引和类型存在,EsSpark仍在尝试创建它,如果它有一个父字段集,则该字段集不是合法操作