Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从spark数据帧索引elasticsearch中的嵌套字段_Python_Apache Spark_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Pyspark - Fatal编程技术网 elasticsearch,pyspark,Python,Apache Spark,elasticsearch,Pyspark" /> elasticsearch,pyspark,Python,Apache Spark,elasticsearch,Pyspark" />

Python 从spark数据帧索引elasticsearch中的嵌套字段

Python 从spark数据帧索引elasticsearch中的嵌套字段,python,apache-spark,elasticsearch,pyspark,Python,Apache Spark,elasticsearch,Pyspark,假设我有一张这样的桌子: field1 field2 field3 id a0 a030 a040 0 a0 a031 a041 0 a0 a032 a042 0 a1 a130 a040 1 它是以拼花地板的形式存储的。我需要在spark中读取表,在“field1”上进行分组,然后我需要在ES中存储一个嵌套字段(例如,称为“agg_字段”),该字段包含一个字典列表,其中包含字段2和字段3的值,这样文档将如下所示: { "

假设我有一张这样的桌子:

field1 field2 field3 id
a0     a030   a040   0  
a0     a031   a041   0
a0     a032   a042   0
a1     a130   a040   1
它是以拼花地板的形式存储的。我需要在spark中读取表,在“field1”上进行分组,然后我需要在ES中存储一个嵌套字段(例如,称为“agg_字段”),该字段包含一个字典列表,其中包含字段2和字段3的值,这样文档将如下所示:

{
  "_id": "0"
  "field1" : "a0",
  "agg_fields" : [
    {
      "field2" : "a030",
      "field3" :  "a040"
    },
    {
      "field2" : "a031",
      "field3" :  "a041"
    },
    {
      "field2" : "a032",
      "field3" :  "a042"
    },
  ]
}
...
我可以在表格中阅读并进行分组:

df = sqlContext.read.parquet('some-table').groupBy('field1')
我可以进行一些聚合并将结果发送给es:

df.withColumn(
    'aggregated', concat('field2', lit('|'), 'field3')
).agg(
    collect_set(aggregated)
).withColumnRenamed(
    'collect_set(aggregated)', 'agg_fields'
).write.format(
    'org.elasticsearch.spark.sql'
).mode(
    'append'
 ).option(
    'es.mapping.id', 'id'
).options(
    **es_config
).option(
    'es.resource', my_resource
).save()
但我不确定如何将聚合更改为嵌套的“agg_fields”列,该列将被elasticsearch解释为嵌套字段。我该怎么做

df = spark.read.load('file:///path/to/your/example.json', format='json')
df = df.withColumn('agg_fields', f.explode(df['agg_fields']))
df = df.groupBy(df['field1']).agg(f.collect_set(f.concat_ws('|', df['agg_fields']['field2'], df['agg_fields']['field3'])).alias('agg_fields'))
输出:

+------+---------------------------------+                                      
|field1|agg_fields                       |
+------+---------------------------------+
|a0    |[a030|a040, a032|a042, a031|a041]|
+------+---------------------------------+
你是说这个吗