Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
<img src="//i.stack.imgur.com/RUiNP.png" height="16" width="18" alt="" class="sponsor tag img">elasticsearch elasticsearch spark在设置父项后索引类型失败_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Apache Spark - Fatal编程技术网 elasticsearch elasticsearch spark在设置父项后索引类型失败,elasticsearch,apache-spark,elasticsearch,Apache Spark" /> elasticsearch elasticsearch spark在设置父项后索引类型失败,elasticsearch,apache-spark,elasticsearch,Apache Spark" />

elasticsearch elasticsearch spark在设置父项后索引类型失败

elasticsearch elasticsearch spark在设置父项后索引类型失败,elasticsearch,apache-spark,elasticsearch,Apache Spark,我首先使用以下命令通过rest API手动设置类型: curl -XPUT localhost:9200/myIndex/ -d '{ "mappings" : { "company": {}, "people": { "_parent" : { "type" : "company" } } } }' 然而,在s

我首先使用以下命令通过rest API手动设置类型:

curl -XPUT localhost:9200/myIndex/ -d '{
  "mappings" : { 
          "company": {}, 
          "people": {
               "_parent" : {
                   "type" : "company"
                }
           }
       }
}'
然而,在spark层,使用以下代码

这里是人员映射

object PeopleDataCleaner {
  def main(args: Array[String]): Unit = {
    val liftedArgs = args.lift
    val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
    val elasticsearchHost = liftedArgs(1).getOrElse("52.35.155.55")
    val elasticsearchPort = liftedArgs(2).getOrElse("9200")
    val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
    val spark = SparkSession.builder()
      .appName("Mongo Data CLeaner")
      .master("local[*]")
      .config("spark.mongodb.input.uri", mongoURL)
      .config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
      .config("mongo.readPreference.name", mongoReadPreferences)
      .config("es.nodes", elasticsearchHost)
      .config("es.port", elasticsearchPort)
      .getOrCreate()
    import spark.implicits._
    val data = MongoSpark.load[LCDataRecord](spark)
      .as[LCDataRecord]
      .filter { record =>
        record.currentCompanies != null &&
        record.currentCompanies.nonEmpty &&
        record.linkedinId != null
      }
      .map { record =>
        val moddedCurrentCompanies = record.currentCompanies
          .filter { currentCompany => currentCompany.link != null && currentCompany.link != "" }
        record.copy(currentCompanies = moddedCurrentCompanies)
      }
      .flatMap { record =>
          record.currentCompanies.map { currentCompany =>
            currentCompanyToFlatPerson(record, currentCompany)
          }
      }
      .saveToEs("myIndex/people", Map(
        "es.mapping.id" -> "idField",
        "es.mapping.parent" -> "companyLink"
      ))
  }
这是公司

object CompanyDataCleaner {
  def main(args: Array[String]): Unit = {
    val liftedArgs = args.lift
    val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
    val elasticsearchHost = liftedArgs(1).getOrElse("localhost")
    val elasticsearchPort = liftedArgs(2).getOrElse("9200")
    val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
    val spark = SparkSession.builder()
      .appName("Mongo Data CLeaner")
      .master("local[*]")
      .config("spark.mongodb.input.uri", mongoURL)
      .config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
      .config("mongo.readPreference.name", mongoReadPreferences)
      .config("es.index.auto.create", "true")
      .config("es.nodes", elasticsearchHost)
      .config("es.port", elasticsearchPort)
      .getOrCreate()

    import spark.implicits._
    val data = MongoSpark
      .load[LCDataRecord](spark)
      .as[LCDataRecord]
      .filter { record => record.currentCompanies != null && record.currentCompanies.nonEmpty }
      .flatMap(record => record.currentCompanies)
      .filter { record => record.link != null }
      .dropDuplicates("link")
      .map(formatCompanySizes)
      .map(companyToFlatCompany)
      .saveToEs("myIndex/company", Map("es.mapping.id" -> "link"))

  }
有一条失败消息说明
org.apache.spark.util.TaskCompletionListenerException:如果未配置父字段,则无法指定父字段
。首先将公司索引到elasticsearch中并不是一个问题,我的理解是上面的映射应该定义了父/子关系

编辑在REST上使用批量API或使用普通REST索引API不会遇到此问题

.config(“es.index.auto.create”,“true”)
设置为
.config(“es.index.auto.create”,“false”)
为我解决了这个问题。看起来,即使索引和类型存在,EsSpark仍在尝试创建它,如果它有一个
父字段集,则该字段集不是合法操作