Pyspark 按组查找所有组合_Pyspark_Combinations

Pyspark 按组查找所有组合

pyspark

Pyspark 按组查找所有组合,pyspark,combinations,Pyspark,Combinations,我想在一个可编辑的对象中找到所有可能的组合我的意见是 Object1|DrDre|1.0 Object1|Plane and a Disaster|2.0 Object1|Tikk Takk Tikk|3.5 Object1|Tennis Dope|5.0 Object2|DrDre|11.0 Object2|Plane and a Disaster|14.0 Object2|Just My Luck|2.0 Object2|Tennis Dope|45.0 预期输出如下所示： [(('Dr

我想在一个可编辑的对象中找到所有可能的组合

我的意见是

Object1|DrDre|1.0
Object1|Plane and a Disaster|2.0
Object1|Tikk Takk Tikk|3.5
Object1|Tennis Dope|5.0
Object2|DrDre|11.0
Object2|Plane and a Disaster|14.0
Object2|Just My Luck|2.0
Object2|Tennis Dope|45.0

预期输出如下所示：

[(('DrDre', 'Plane and a Disaster'), (11.0, 14.0, 1.0, 2.0)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('DrDre', 'Tennis Dope'), (11.0, 45.0, 1.0, 5.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('Plane and a Disaster', 'Tennis Dope'), (14.0, 45.0, 2.0, 5.0)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 45.0)),
(('DrDre', 'Just My Luck'), (11.0, 2.0)),
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]

这是我当前的代码，最终没有给出正确的组合

def iterate(iterable):
    r = []
    for v1_iterable in iterable:
        for v2 in v1_iterable:
            r.append(v2)

    return tuple(r)

def parseVector(line):
    '''
    Parse each line of the specified data file, assuming a "|" delimiter.
    Converts each rating to a float
    '''
    line = line.split("|")
    return line[0],(line[1],float(line[2]))

def FindPairs(object_id,items_with_usage):
    '''
    For each objects, find all item-item pairs combos. (i.e. items with the same user) 
    '''
    for item1,item2 in combinations(items_with_usage,2):
        return (item1[0],item2[0]),(item1[1],item2[1])


''' 
Obtain the sparse object-item matrix:
    user_id -> [(object_id_1, rating_1),
               [(object_id_2, rating_2),
                ...]
'''
object_item_pairs = lines.map(parseVector).groupByKey().map(
    lambda p: sampleInteractions(p[0],p[1],500)).cache()


'''
Get all item-item pair combos:
    (item1,item2) ->    [(item1_rating,item2_rating),
                         (item1_rating,item2_rating),
                         ...]
'''

pairwise_objects = object_item_pairs.filter(
    lambda p: len(p[1]) > 1).map(
    lambda p: findItemPairs(p[0],p[1])).groupByKey()



x = pairwise_objects.mapValues(iterate)
x.collect()

这只给了我第一双，其他什么都没有

[（（'DrDre'，'Plane and a Disaster'），（11.0,14.0,1.0,2.0））]

我是否误解了combines（）函数的功能

感谢您的投入

我认为您可以通过这种方式转换您的FindPair

def FindPairs(object_id,items_with_usage):
'''
For each objects, find all item-item pairs combos. (i.e. items with the same user) 
'''
t = []   
for item1,item2 in combinations(items_with_usage,2):
    t.append(((item1[0],item2[0]),(item1[1],item2[1])))
return t

现在，函数将返回一个包含所有组合对的列表

然后

在对RDD进行分组和应用函数之前，请使用flatMap（这样您将有一行包含所有对）

pairwise_objects=pairwise_objects.flatMap(lambda p: p).groupByKey().mapValues(iterate)

最终输出：

[(('DrDre', 'Tennis Dope'), (1.0, 5.0, 11.0, 45.0)),
(('DrDre', 'Plane and a Disaster'), (1.0, 2.0, 11.0, 14.0)), 
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0, 14.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]

将

return

命令放入for循环中，这意味着循环将在第一个循环结束。这就是为什么您只有第一对，因为您没有存储

组合的所有元素（items\u with\u usage，2）

，您只返回第一对itemsAh，非常感谢titiro89！！：）

[(('DrDre', 'Tennis Dope'), (1.0, 5.0, 11.0, 45.0)),
(('DrDre', 'Plane and a Disaster'), (1.0, 2.0, 11.0, 14.0)), 
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0, 14.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]