Apache spark 如何在Pypark中做反爆炸?

Apache spark 如何在Pypark中做反爆炸?,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,假设我有一个DataFrame,其中一列为用户,另一列为他们写的单词: 行(user='Bob',word='hello') 行(user='Bob',word='world') 行(user='Mary',word='Have') 行(user='Mary',word='a') 行(user='Mary',word='nice') 行(user='Mary',word='day') 我想将word列聚合为一个向量: 行(user='Bob',words=['hello','world'])

假设我有一个
DataFrame
,其中一列为用户,另一列为他们写的单词:

行(user='Bob',word='hello')
行(user='Bob',word='world')
行(user='Mary',word='Have')
行(user='Mary',word='a')
行(user='Mary',word='nice')
行(user='Mary',word='day')
我想将
word
列聚合为一个向量:

行(user='Bob',words=['hello','world'])
行(user='Mary',words='Have','a','nice','day'])

似乎我不能使用任何Sparks分组函数,因为它们需要后续的聚合步骤。我的用例是,我希望将这些数据输入到
Word2Vec
中,而不使用其他Spark聚合。

这里有一个使用
rdd
的解决方案

from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
group_user = rdd.groupBy(lambda x: x.user)
group_agg = group_user.map(lambda x: Row(**{'user': x[0], 'word': [t.word for t in x[1]]}))
组收集()输出


感谢@titipat提供RDD解决方案。在我发表文章后不久,我确实意识到实际上有一个使用
collect\u set
(或
collect\u list
)的数据帧解决方案:


从spark 2.3版本开始,我们现在有了熊猫UDF(又名矢量化UDF)。下面的功能将完成OP的任务。。。使用此函数的一个好处是可以保证保留顺序。在许多情况下,顺序是必不可少的,例如时间序列分析

import pandas as pd
import findspark

findspark.init()
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, ArrayType

spark = SparkSession.builder.appName('test_collect_array_grouped').getOrCreate()

def collect_array_grouped(df, groupbyCols, aggregateCol, outputCol):
    """
    Aggregate function: returns a new :class:`DataFrame` such that for a given column, aggregateCol,
    in a DataFrame, df, collect into an array the elements for each grouping defined by the groupbyCols list.
    The new DataFrame will have, for each row, the grouping columns and an array of the grouped
    values from aggregateCol in the outputCol.

    :param groupbyCols: list of columns to group by.
            Each element should be a column name (string) or an expression (:class:`Column`).
    :param aggregateCol: the column name of the column of values to aggregate into an array
            for each grouping.
    :param outputCol: the column name of the column to output the aggregeted array to.
    """
    groupbyCols = [] if groupbyCols is None else groupbyCols
    df = df.select(groupbyCols + [aggregateCol])
    schema = df.select(groupbyCols).schema
    aggSchema = df.select(aggregateCol).schema
    arrayField = StructField(name=outputCol, dataType=ArrayType(aggSchema[0].dataType, False))
    schema = schema.add(arrayField)
    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def _get_array(pd_df):
        vals = pd_df[groupbyCols].iloc[0].tolist()
        vals.append(pd_df[aggregateCol].values)
        return pd.DataFrame([vals])
    return df.groupby(groupbyCols).apply(_get_array)

rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
df = spark.createDataFrame(rdd)

collect_array_grouped(df, ['user'], 'word', 'users_words').show()

+----+--------------------+
|user|         users_words|
+----+--------------------+
|Mary|[Have, a, nice, day]|
| Bob|      [hello, world]|
+----+--------------------+

为此,您有一个本机聚合函数collect_set(docs)

然后,您可以使用:

from pyspark.sql import functions as F
df.groupby("user").agg(F.collect_set("word"))

很好的解决方案,艾凡!我本来也打算发布pyspark数据帧解决方案,但你已经想好了:)使用collect_list保留顺序吗?@Evan我知道使用collet_list执行oderby不会保留顺序。@Evan情况不同。orderby订单不受尊重。我知道这是因为它咬了我一口,但我始终无法确定collet_list是否保留了原始顺序。如果列表来自跨分区的数据,会发生什么情况?这种行为没有很好的文档记录。在我的例子中,单词的顺序并不重要,但我相信在某些应用中它可能会很重要。我想作为一个规则,除非文档明确地说,@lfvv collect\u set删除重复项,否则我不会假定顺序被保留。
from pyspark.sql import functions as F

df.groupby("user").agg(F.collect_list("word"))
import pandas as pd
import findspark

findspark.init()
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, ArrayType

spark = SparkSession.builder.appName('test_collect_array_grouped').getOrCreate()

def collect_array_grouped(df, groupbyCols, aggregateCol, outputCol):
    """
    Aggregate function: returns a new :class:`DataFrame` such that for a given column, aggregateCol,
    in a DataFrame, df, collect into an array the elements for each grouping defined by the groupbyCols list.
    The new DataFrame will have, for each row, the grouping columns and an array of the grouped
    values from aggregateCol in the outputCol.

    :param groupbyCols: list of columns to group by.
            Each element should be a column name (string) or an expression (:class:`Column`).
    :param aggregateCol: the column name of the column of values to aggregate into an array
            for each grouping.
    :param outputCol: the column name of the column to output the aggregeted array to.
    """
    groupbyCols = [] if groupbyCols is None else groupbyCols
    df = df.select(groupbyCols + [aggregateCol])
    schema = df.select(groupbyCols).schema
    aggSchema = df.select(aggregateCol).schema
    arrayField = StructField(name=outputCol, dataType=ArrayType(aggSchema[0].dataType, False))
    schema = schema.add(arrayField)
    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def _get_array(pd_df):
        vals = pd_df[groupbyCols].iloc[0].tolist()
        vals.append(pd_df[aggregateCol].values)
        return pd.DataFrame([vals])
    return df.groupby(groupbyCols).apply(_get_array)

rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
df = spark.createDataFrame(rdd)

collect_array_grouped(df, ['user'], 'word', 'users_words').show()

+----+--------------------+
|user|         users_words|
+----+--------------------+
|Mary|[Have, a, nice, day]|
| Bob|      [hello, world]|
+----+--------------------+
from pyspark.sql import functions as F
df.groupby("user").agg(F.collect_set("word"))