Sql Pyspark分组和结构化数据_Sql_Pyspark

Sql Pyspark分组和结构化数据

sql pyspark

Sql Pyspark分组和结构化数据,sql,pyspark,Sql,Pyspark,我在spark 2.4.5中有以下数据： data = [ ('1234', '203957', '2010', 'London', 'CHEM'), ('1234', '203957', '2010', 'London', 'BIOL'), ('1234', '288400', '2012', 'Berlin', 'MATH'), ('1234', '288400', '2012', 'Berlin', 'CHEM'), ] d = spark.createDa

我在spark 2.4.5中有以下数据：

data = [
    ('1234', '203957', '2010', 'London', 'CHEM'),
    ('1234', '203957', '2010', 'London', 'BIOL'),
    ('1234', '288400', '2012', 'Berlin', 'MATH'),
    ('1234', '288400', '2012', 'Berlin', 'CHEM'),
]
d = spark.createDataFrame(data, ['auid', 'eid', 'year', 'city', 'subject'])
d.show()

+----+------+----+------+-------+
|auid|   eid|year|  city|subject|
+----+------+----+------+-------+
|1234|203957|2010|London|   CHEM|
|1234|203957|2010|London|   BIOL|
|1234|288400|2012|Berlin|   MATH|
|1234|288400|2012|Berlin|   CHEM|
+----+------+----+------+-------+

从中，我需要将df按

auid

进行分组，并按城市的时间顺序排列，即

伦敦、柏林

和

[[Berlin，2010]，[London，2012]]

在另一列中，另外我需要按降序频率列对主题进行排序：

[CHEM，2]，[BIOL，1]，[MATH，1]

。或者就像化学、生物、数学一样

我试过这个：

d.groupBy('auid').agg(func.collect_set(func.struct('city', 'year')).alias('city_set')).show(10, False)

这就导致了：

+----+--------------------------------+
|auid|city_set                        |
+----+--------------------------------+
|1234|[[Berlin, 2012], [London, 2010]]|
+----+--------------------------------+

我被困在这里，需要帮助。（请提示如何对

city\u集合中的值进行排序

）

您可以在

struct（'year'，'city'）

上对collect\u列表进行聚合，对数组进行排序，然后使用

transform

函数调整字段的顺序。与主题类似，创建一个包含两个字段的结构数组：

cnt

和

subject

，对结构数组进行排序/描述，然后仅检索

subject

字段：

df_new = d.groupBy('auid').agg(
      func.sort_array(func.collect_set(func.struct('year', 'city'))).alias('city_set'),
      func.collect_list('subject').alias('subjects')
    ).withColumn('city_set', func.expr("transform(city_set, x -> (x.city as city, x.year as year))")) \
    .withColumn('subjects', func.expr("""
        sort_array(
          transform(array_distinct(subjects), x -> (size(filter(subjects, y -> y=x)) as cnt, x as subject)),
          False
        ).subject
      """))

df_new.show(truncate=False) 
+----+--------------------------------+------------------+
|auid|city_set                        |subjects          |
+----+--------------------------------+------------------+
|1234|[[London, 2010], [Berlin, 2012]]|[CHEM, MATH, BIOL]|
+----+--------------------------------+------------------+

编辑：有几种方法可以删除

city\u集合

数组中的重复城市条目：

df_new = d.groupBy('auid').agg(
    func.sort_array(func.collect_set(func.struct('year', 'city'))).alias('city_set')     
).withColumn("city_set", func.expr("""         
    aggregate(        
      /* expr: take slice of city_set array from the 2nd element to the last */
      slice(city_set,2,size(city_set)-1),           
      /* start: initialize `acc` as an array with a single entry city_set[0].city */
      array(city_set[0].city),
      /* merge: iterate through `expr`, if x.city exists in `acc`, keep as-is
       *        , otherwise add an entry to `acc` using concat function */
      (acc,x) -> IF(array_contains(acc,x.city), acc, concat(acc, array(x.city)))                     
    )                              
"""))

使用窗口功能将每个城市的

年调整为最小（年），然后重复上述步骤
d = d.withColumn('year', func.min('year').over(Window.partitionBy('auid','city')))


使用此功能可从city\u集合
数组中删除重复项：
df_new = d.groupBy('auid').agg(
    func.sort_array(func.collect_set(func.struct('year', 'city'))).alias('city_set')     
).withColumn("city_set", func.expr("""         
    aggregate(        
      /* expr: take slice of city_set array from the 2nd element to the last */
      slice(city_set,2,size(city_set)-1),           
      /* start: initialize `acc` as an array with a single entry city_set[0].city */
      array(city_set[0].city),
      /* merge: iterate through `expr`, if x.city exists in `acc`, keep as-is
       *        , otherwise add an entry to `acc` using concat function */
      (acc,x) -> IF(array_contains(acc,x.city), acc, concat(acc, array(x.city)))                     
    )                              
"""))


注意：使用Spark 3.0+会容易得多，不过：
df_new = d.groupBy('auid').agg(func.expr("array_sort(collect_set((city,year)), (l,r) -> int(l.year-r.year)) as city_set"))

太好了，谢谢！还不知道如何获得一个只包含城市的列，按顺序排序，即[London，Berlin]？@ande，如果您只需要结构数组中的一个字段，只需跳过转换函数并使用dot
符号：d.groupBy（'auid'）.agg（funct.sort_数组（funct.collect_set（funct.struct（'year'，'city'）））.city.alias）（'city_set'）
，另请参见我们对主题列.BTW所做的操作。更可靠的方法是使用.getItem（'city'）
或['city']
如果字段名包含空格、圆点等特殊字符，再次感谢！不过，我确实看到了示例数据的一些局限性。伦敦可能有两个不同的文档，因此在这种情况下，我看到[伦敦，伦敦，柏林]，而我想看到[伦敦，柏林]。你认为这可能吗？@ande，非常欢迎你。对于你刚才提到的问题，我认为一个简单的解决方法是使用窗口函数创建一个临时列（或者只覆盖现有的year
列），比如：d=d.withColumn（'year_min'，func.min（'year'）。over（Window.partitionBy（'auid'，'city'））
，然后使用year\u min替换代码中的year。