Python 按行中非空元素的计数对PySpark数据帧进行统一分区_Python_Performance_Machine Learning_Pyspark_Spark Dataframe

Python 按行中非空元素的计数对PySpark数据帧进行统一分区

python performance machine-learning pyspark

Python 按行中非空元素的计数对PySpark数据帧进行统一分区,python,performance,machine-learning,pyspark,spark-dataframe,Python,Performance,Machine Learning,Pyspark,Spark Dataframe,我知道有一千个问题与如何最好地通过盐碱键来划分DataFrames或RDD有关，但我认为这种情况已经不同到足以证明它自己的问题了我正在PySpark中构建一个协作过滤推荐引擎，这意味着需要比较每个用户（行）的独特项目评级。因此，对于维度为M（行）xn（列）的DataFrame，这意味着数据集将变成mx（K选择2），其中K您可以做的是根据用户的评级数量获得一个排序的用户列表，然后将其列中的索引除以分区数量。获取除法的余数作为一列，然后在该列上使用partitionBy（）重新分区。这样，您的分区

我知道有一千个问题与如何最好地通过盐碱键来划分

DataFrames

或

RDD

有关，但我认为这种情况已经不同到足以证明它自己的问题了

我正在PySpark中构建一个协作过滤推荐引擎，这意味着需要比较每个用户（行）的独特项目评级。因此，对于维度为

M（行）xn（列）

的

DataFrame

，这意味着数据集将变成

mx（K选择2）

，其中

K您可以做的是根据用户的评级数量获得一个排序的用户列表，然后将其列中的索引除以分区数量。获取除法的余数作为一列，然后在该列上使用partitionBy（）
重新分区。这样，您的分区将拥有几乎相等的所有用户分级计数表示
对于3个分区，这将使您：
[1000, 800, 700, 600, 200, 30, 10, 5] - number of ratings
[   0,   1,   2,   3,   4,  5,  6, 7] - position in sorted index
[   0,   1,   2,   0,   1,  2,  0, 1] - group to partition by

嘿，非常感谢你的回答。我实现了它（在我的问题的编辑中添加了一个函数）。我认为这是一个非常优雅的解决方案，但无论出于什么原因，我都看不到有什么改进。欢迎。尝试比较不同分区的评级分布直方图。如果它们看起来相似，那么你已经到达了底部。您仍然可以做的是将“power voter”虚拟地拆分为多个用户，然后将结果合并回来（如果适用）。顺便说一句，用户的投票应该根据齐普夫定律进行分配，所以拥有“有权投票人”是正常的。柱状图大致类似于洗牌示例。很高兴知道RE:Zipfian选民分布；我不知道。尽管它产生了大致相似的结果，但我认为这是正确的答案。
def _make_ratings(row):
    import numpy as np
    non_null_mask = ~np.isnan(row)
    idcs = np.where(non_null_mask)[0]  # extract the non-null index mask

    # zip the non-null idcs with the corresponding ratings
    rtgs = row[non_null_mask]
    return list(zip(idcs, rtgs))


def as_array(partition):
    import numpy as np
    for row in partition:
        yield _make_ratings(np.asarray(row, dtype=np.float32))


# drop the id column, get the RDD, and make the copy of np.ndarrays
ratings = R.drop('id').rdd\
           .mapPartitions(as_array)\
           .cache()

n_choose_2 = (lambda itrbl: (len(itrbl) * (len(itrbl) - 1)) / 2.)
sorted(ratings.map(n_choose_2).glom().map(sum).collect(), reverse=True)

def shuffle_partition(X, n_partitions, col_name='shuffle'):
    from pyspark.sql.functions import rand
    X2 = X.withColumn(col_name, rand())
    return X2.repartition(n_partitions, col_name).drop(col_name)

def partition_by_rating_density(X, id_col_name, n_partitions,
                                partition_col_name='partition'):
    """Segment partitions by rating density. Partitions will be more
    evenly distributed based on the number of ratings for each user.

    Parameters
    ----------
    X : PySpark DataFrame
        The ratings matrix

    id_col_name : str
        The ID column name

    n_partitions : int
        The number of partitions in the new DataFrame.

    partition_col_name : str
        The name of the partitioning column

    Returns
    -------
    with_partition_key : PySpark DataFrame
        The partitioned DataFrame
    """
    ididx = X.columns.index(id_col_name)

    def count_non_null(row):
        sm = sum(1 if v is not None else 0
                 for i, v in enumerate(row) if i != ididx)
        return row[ididx], sm

    # add the count as the last element and id as the first
    counted = X.rdd.map(count_non_null)\
               .sortBy(lambda r: r[-1], ascending=False)

    # get the count array out, zip it with the index, and then flatMap
    # it out to get the sorted index
    indexed = counted.zipWithIndex()\
                     .map(lambda ti: (ti[0][0], ti[1] % n_partitions))\
                     .toDF([id_col_name, partition_col_name])

    # join back with indexed, which now has the partition column
    counted_indexed = X.join(indexed, on=id_col_name, how='inner')

    # the columns to drop
    return counted_indexed.repartition(n_partitions, partition_col_name)\
        .drop(partition_col_name)

[1000, 800, 700, 600, 200, 30, 10, 5] - number of ratings
[   0,   1,   2,   3,   4,  5,  6, 7] - position in sorted index
[   0,   1,   2,   0,   1,  2,  0, 1] - group to partition by