Loops Pyspark在同一数据帧中嵌套循环。如何迭代?

Loops Pyspark在同一数据帧中嵌套循环。如何迭代?,loops,dataframe,join,pyspark,nested,Loops,Dataframe,Join,Pyspark,Nested,编辑TL;医生: 我试图在pyspark数据帧中实现嵌套循环。如您所见,我希望嵌套循环在每次迭代中从下一行(相对于第一个循环)开始,以减少不必要的迭代。使用Python,我可以使用[row.Index+1:][/strong>。通常,在纯Python中,我可以通过以下代码实现这一点: r=1000 joins=[] for row in df.itertuples(): for k in df[row.Index+1:].itertuples(): a = np.arra

编辑TL;医生: 我试图在pyspark数据帧中实现嵌套循环。如您所见,我希望嵌套循环在每次迭代中从下一行(相对于第一个循环)开始,以减少不必要的迭代。使用Python,我可以使用[row.Index+1:][/strong>。通常,在纯Python中,我可以通过以下代码实现这一点:

r=1000
joins=[]
for row in df.itertuples():
    for k in df[row.Index+1:].itertuples():
        a = np.array((row.x, row.y))
        b = np.array((k.x, k.y))
        if row.class!= k.class:
            if row.x < k.x+r:
                if np.linalg.norm(a-b) < r:
                    joins.append((row.name, k.name))
print(joins)
我把它转换成一个spark数据框,我按“y”排序,降序:

schema = StructType([
    StructField('name', StringType(), True),
    StructField('x', IntegerType(), True),
    StructField('y', IntegerType(), True),
    StructField('class', StringType(), True),
])
rdd = spark.sparkContext.parallelize(people)
df = spark.createDataFrame(rdd,schema)
df.orderBy('y',ascending=False).show()
我想要的是,当一个条件满足时,将“A类”名称与“B类”名称连接起来。例如,当欧几里德距离(A项-B项)<10时。 我认为交叉连接不是最好的主意,因为它可能会耗费大量数据集的时间

我想如果我能重复一下就好了。 伪代码:

start with row1
if row1[class] != row2[class]:
    if row1['y'] - row2['y'] < 10:
        if Euclidean distance(row1.item - row2.item) < 10:
            join row1.name, row2.name
        end
    else break

it keeps iterating until row1['y'] - row['y'] >= 10

then a new iteration starts from row2
if row2[class] != row3[class]:
    etc etc

(其项属于不同类的唯一一对&它们的欧几里德距离为7,小于10)

如果我正确理解了这个问题,您可以基于
class
列将数据帧拆分为两个数据帧,然后基于指定的join子句(使用外部join)将它们连接起来:

更新

下面是一个使用
mapPartitions
、不连接或转换为DataFrame的实现:

import math

from pyspark.sql import SparkSession


def process_data(rows):
    r = 1000
    joins = []
    for row1 in rows:
        for row2 in rows:
            if row1['class'] != row2['class']:
                if row1['x'] < row2['x'] + r:
                    if math.sqrt((row1['x'] - row2['x']) ** 2 + (row1['y'] - row2['y']) ** 2) <= r:
                        joins.append((row1['name'], row2['name']))
    return joins


spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

people = [('john', 35, 54, 'A'),
          ('george', 94, 84, 'B'),
          ('nicolas', 7, 9, 'B'),
          ('tom', 86, 93, 'A'),
          ('jason', 62, 73, 'B'),
          ('bill', 15, 58, 'A'),
          ('william', 9, 3, 'A'),
          ('brad', 73, 37, 'B'),
          ('cosmo', 52, 67, 'B'),
          ('jerry', 73, 30, 'A')]

fields = ('name', 'x', 'y', 'class')

data = [dict(zip(fields, person)) for person in people]

rdd = spark.sparkContext.parallelize(data)

result = rdd.mapPartitions(process_data).collect()
print(result)
更新2

在“y”字段上添加了初始排序步骤,重新分区以确保所有数据都在一个分区上(以便可以比较所有记录),并更改了嵌套循环:

import math

from pyspark.sql import SparkSession


def process_data(rows):
    r = 1000
    joins = []
    rows = list(rows)
    for i, row1 in enumerate(rows):
        for row2 in rows[i:]:
            if row1['class'] != row2['class']:
                if row1['x'] < row2['x'] + r:
                    if math.sqrt((row1['x'] - row2['x']) ** 2 + (row1['y'] - row2['y']) ** 2) < r:
                        joins.append((row1['name'], row2['name']))
    return joins


spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

people = [('john', 35, 54, 'A'),
          ('george', 94, 84, 'B'),
          ('nicolas', 7, 9, 'B'),
          ('tom', 86, 93, 'A'),
          ('jason', 62, 73, 'B'),
          ('bill', 15, 58, 'A'),
          ('william', 9, 3, 'A'),
          ('brad', 73, 37, 'B'),
          ('cosmo', 52, 67, 'B'),
          ('jerry', 73, 30, 'A')]

fields = ('name', 'x', 'y', 'class')

data = [dict(zip(fields, person)) for person in people]

rdd = spark.sparkContext.parallelize(data)

result = rdd.sortBy(lambda x: x['y'], ascending=False).repartition(1).mapPartitions(process_data).collect()
print(result)

正如我所提到的,我不希望使用联接,因为它们检查每个A.item和每个B.item。我正在搜索一个类似于我发布的伪代码的解决方案,以减少检查和执行时间。我想对下一行迭代row1,直到“row1.y-row2.y<10”子句失败。然后对下一行迭代第2行等。同样,当前面提到的子句完成时,将计算第二个(完整的欧几里德距离)。你可以在我帖子的开头找到它。我给出了迭代的Python代码。我只想在SparkCheck out my update中实现同样的效果-这是一种更好的方法/更符合您所寻找的内容吗?这更好,但它忽略了一点:重要的是,嵌套循环在每次迭代时都从第一个循环的下一行开始。在python中,我们可以通过df[row.Index+1:][.itertuples():中的k来实现这一点。我们现在怎么能这么做呢?还有,肯定还有别的错误,对于r=1000,每一对都应该基于欧几里德距离进行连接。
brad, jerry
from pyspark.sql.functions import col, collect_list, struct
A_df = df.where(col('class') == 'A').withColumnRenamed('name', 'A.name')
B_df = df.where(col('class') == 'B').withColumnRenamed('name', 'B.name')

join_clause = A_df.y - B_df.y <= 10
result = A_df.join(B_df, join_clause, 'outer')
result = result.withColumn(collect_list(struct(col('A.name'), col('B.name')))
import math

from pyspark.sql import SparkSession


def process_data(rows):
    r = 1000
    joins = []
    for row1 in rows:
        for row2 in rows:
            if row1['class'] != row2['class']:
                if row1['x'] < row2['x'] + r:
                    if math.sqrt((row1['x'] - row2['x']) ** 2 + (row1['y'] - row2['y']) ** 2) <= r:
                        joins.append((row1['name'], row2['name']))
    return joins


spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

people = [('john', 35, 54, 'A'),
          ('george', 94, 84, 'B'),
          ('nicolas', 7, 9, 'B'),
          ('tom', 86, 93, 'A'),
          ('jason', 62, 73, 'B'),
          ('bill', 15, 58, 'A'),
          ('william', 9, 3, 'A'),
          ('brad', 73, 37, 'B'),
          ('cosmo', 52, 67, 'B'),
          ('jerry', 73, 30, 'A')]

fields = ('name', 'x', 'y', 'class')

data = [dict(zip(fields, person)) for person in people]

rdd = spark.sparkContext.parallelize(data)

result = rdd.mapPartitions(process_data).collect()
print(result)
[('tom', 'jason'), ('cosmo', 'jerry')]
import math

from pyspark.sql import SparkSession


def process_data(rows):
    r = 1000
    joins = []
    rows = list(rows)
    for i, row1 in enumerate(rows):
        for row2 in rows[i:]:
            if row1['class'] != row2['class']:
                if row1['x'] < row2['x'] + r:
                    if math.sqrt((row1['x'] - row2['x']) ** 2 + (row1['y'] - row2['y']) ** 2) < r:
                        joins.append((row1['name'], row2['name']))
    return joins


spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

people = [('john', 35, 54, 'A'),
          ('george', 94, 84, 'B'),
          ('nicolas', 7, 9, 'B'),
          ('tom', 86, 93, 'A'),
          ('jason', 62, 73, 'B'),
          ('bill', 15, 58, 'A'),
          ('william', 9, 3, 'A'),
          ('brad', 73, 37, 'B'),
          ('cosmo', 52, 67, 'B'),
          ('jerry', 73, 30, 'A')]

fields = ('name', 'x', 'y', 'class')

data = [dict(zip(fields, person)) for person in people]

rdd = spark.sparkContext.parallelize(data)

result = rdd.sortBy(lambda x: x['y'], ascending=False).repartition(1).mapPartitions(process_data).collect()
print(result)
[('william', 'nicolas'), ('william', 'brad'), ('william', 'cosmo'), ('william', 'jason'), ('william', 'george'), ('nicolas', 'jerry'), ('nicolas', 'john'), ('nicolas', 'bill'), ('nicolas', 'tom'), ('jerry', 'brad'), ('jerry', 'cosmo'), ('jerry', 'jason'), ('jerry', 'george'), ('brad', 'john'), ('brad', 'bill'), ('brad', 'tom'), ('john', 'cosmo'), ('john', 'jason'), ('john', 'george'), ('bill', 'cosmo'), ('bill', 'jason'), ('bill', 'george'), ('cosmo', 'tom'), ('jason', 'tom'), ('george', 'tom')]