Apache spark 将列表中元素后的n个元素与列表本身连接起来_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 将列表中元素后的n个元素与列表本身连接起来

apache-spark pyspark

Apache spark 将列表中元素后的n个元素与列表本身连接起来,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,使用PySpark 跟进：我想我只需要知道如何在列表中的元素之后选择n元素，并将它们与列表本身连接起来例如，您有一个列表'a'，'b'，'c'，'d'，'e'，'f'，'g' +-------+-----+ | _index| item| +-------+-----+ | 0 | a | | 1 | b | | 2 | c | | 3 | d | | 4 | e | | 5 | f | | 6 | g | +

使用PySpark

跟进：我想我只需要知道如何在列表中的元素之后选择

元素，并将它们与列表本身连接起来

例如，您有一个列表'a'，'b'，'c'，'d'，'e'，'f'，'g'

+-------+-----+
| _index| item|
+-------+-----+
|   0   |   a |
|   1   |   b |
|   2   |   c |
|   3   |   d |
|   4   |   e |
|   5   |   f |
|   6   |   g |
+-------+-----+

指数从0到6；我们想把，比如说，

n=3

元素加上列表本身，我们得到

+--------+-------+-------+
| _index | item1 | item2 |
+--------+-------+-------+
|   3    |   d   |   d   |
|   4    |   e   |   e   |
|   5    |   f   |   f   |
+--------+-------+-------+

以下是一段相关代码。是否可以修改此代码以拾取距离

内

之后的元素，并将它们与包含

的列表连接起来？我是新的火花，我想得到一些帮助！谢谢

假设我们有许多列表。我们首先在这些列表中找到一个具有某种条件的元素

condition1

。给它一个别名

如果我们在

的索引之后随机选择另一个元素（在一定的索引距离内，比如

1-3

），然后将其与包含

的列表连接，那么我们可以执行以下操作

df.where(
    (col('condition1')==0) # finds an element satisfying some condition, name it as 'A'
).alias('A').join(
    df.alias('B'), 
    # randomly pick another element after 'A' within index distance 1 to 3
    # and join it with the list that contains 'A'
    ((col('A.ListId')==col('B.ListId')) & (random.randint(1,4)+col('A._index'))==col('B._index'))
)

以下是您可以应用的可能解决方案示例：

l = [(0,"a"), (1,"b"), (2,"c"), (3,"d"), (4,"e"), (5,"f"), (6,"g")]
df = spark.createDataFrame(l, schema=["_index", "item"])

# just get the value out of the row
start = df.filter(df.item == "c").select("_index").first()[0]
df.filter((df._index > start) & (df._index <= random.randint(start + 1, start + 4))).show()

l=[（0，“a”），（1，“b”），（2，“c”），（3，“d”），（4，“e”），（5，“f”），（6，“g”）]
df=spark.createDataFrame（l，schema=[“\u index”，“item”]）
#只需从行中获取值
start=df.filter（df.item==“c”）。选择（“_index”）。first（）[0]
df.filter（（df._index>start）和（df._index）这都是非常抽象的-如果您可以提供一个具有代表性的小输入数据帧和所需的输出，这将非常有用。我认为没有笛卡尔积就没有办法做到这一点。