Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/358.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何识别N个最后唯一的单元格?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何识别N个最后唯一的单元格?

Python 如何识别N个最后唯一的单元格?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我在数据框中有一列ID(所有ID都被人的移动捕获)。我使用collect\u list创建了一个新列,以获取所有ID作为列表。如何获取所创建数组的最后N=5个元素 我使用一个UDF解决了这个问题,但是性能不是很好,我需要将数据帧分割成许多块。是否可以使用一些火花功能或其他技术来提高性能 index;collect_list;window 0;[F];[F] 1;[F, B];[B, F] 2;[F, B, A];[A, B, F] 3;[F, B, A, F];[A, B, F] 4;[F, B

我在数据框中有一列
ID
(所有ID都被人的移动捕获)。我使用
collect\u list
创建了一个新列,以获取所有ID作为列表。如何获取所创建数组的最后N=5个元素

我使用一个UDF解决了这个问题,但是性能不是很好,我需要将数据帧分割成许多块。是否可以使用一些火花功能或其他技术来提高性能

index;collect_list;window
0;[F];[F]
1;[F, B];[B, F]
2;[F, B, A];[A, B, F]
3;[F, B, A, F];[A, B, F]
4;[F, B, A, F, B];[A, B, F]
5;[F, B, A, F, B, G];[A, B, F, G]
6;[F, B, A, F, B, G, E];[A, B, E, F, G]
7;[F, B, A, F, B, G, E, F];[A, B, E, F, G]
8;[F, B, A, F, B, G, E, F, E];[A, B, E, F, G]
9;[F, B, A, F, B, G, E, F, E, B];[A, B, E, F, G]
10;[F, B, A, F, B, G, E, F, E, B, A];[A, B, E, F, G]
11;[F, B, A, F, B, G, E, F, E, B, A, D];[A, B, D, E, F]
12;[F, B, A, F, B, G, E, F, E, B, A, D, F];[A, B, D, E, F]
13;[F, B, A, F, B, G, E, F, E, B, A, D, F, E];[A, B, D, E, F]
14;[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E];[A, B, D, E, F]
15;[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D];[A, B, D, E, F]

这是我在Python中为Pandas和Spark使用的函数:

def getLast(arr, n=5):
    
    addedSet = set()
    
    result = []
    
    lenArr = len(arr)

    
    for i in range(lenArr):
    
        element = arr[lenArr-i-1]
        
        if element not in addedSet:
            
            result.append(element)
            addedSet.add(element)
        
            if len(addedSet) == 5 : break
        
    return sorted(result)
我还在Scala中编写了该函数以提高性能,但它会因
内存不足错误而崩溃:

  def list_array_V(myArray: Array[String], n: Int = 5): Array[String] = {

    @tailrec
    def window(myArray: Array[String], currentSet: Set[String] = Set()): Array[String] = {

      if (currentSet.size >= n) currentSet.toArray
      else window(myArray.slice(0, myArray.length - 1), myArray.takeRight(1).toSet ++ currentSet)

    }

    window(myArray)

  }


我的Spark版本是2.3。

你能试试这个udf解决方案吗:

val df = spark.sql(""" with t1( 
select array("F") as x  union all
select array("F,B") as x  union all
select array("F,B,A") as x  union all
select array("F,B,A,F") as x  union all
select array("F,B,A,F,B") as x  union all
select array("F,B,A,F,B,G") as x  union all
select array("F,B,A,F,B,G,E") as x  union all
select array("F,B,A,F,B,G,E,F") as x  union all
select array("F,B,A,F,B,G,E,F,E") as x  union all
select array("F,B,A,F,B,G,E,F,E,B") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A,D") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E,E") as x  union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E,E,D") as x  
) 
select x as arr from t1
""")


def last_5(arr :Seq[String]): Seq[String] = {
      val x1=scala.collection.mutable.Set[String]()
      arr.reverse.map( x => if(x1.size < 5 ) x1.add(x) )
      x1.toSeq
    }

val udf_last_5 = udf( last_5(_:Seq[String]) )

df.withColumn("result",udf_last_5(col("arr"))).show(false)

+------------------------------------------------+---------------+
|arr                                             |result         |
+------------------------------------------------+---------------+
|[F]                                             |[F]            |
|[F, B]                                          |[B, F]         |
|[F, B, A]                                       |[B, F, A]      |
|[F, B, A, F]                                    |[B, F, A]      |
|[F, B, A, F, B]                                 |[B, F, A]      |
|[F, B, A, F, B, G]                              |[B, F, G, A]   |
|[F, B, A, F, B, G, E]                           |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F]                        |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E]                     |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B]                  |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A]               |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D]            |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F]         |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E]      |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E]   |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D]|[B, F, D, E, A]|
+------------------------------------------------+---------------+

试试下面的代码?不需要自定义项;)


是的,对不起,请执行
spark.sql('set spark.sql.mapkeydeduplicy=LAST_WIN')
您能解释一下为什么它比我的scala函数更快吗?一旦我们得到最后五个唯一项,数组遍历就停止了。我们正在使用break并退出for循环。在其他解决方案中,我们一直迭代到最后。
def last_5(arr :Seq[String]): Seq[String] = {
    import scala.util.control.Breaks._
    val x1=scala.collection.mutable.Set[String]()
    breakable { 
        for ( x <- arr.reverse )
        { 
            if(x1.size == 5 )
            {
                break
            }   
            x1.add(x)
        }
    }
    x1.toSeq
}

val udf_last_5 = udf( last_5(_:Seq[String]) )

df.withColumn("result",udf_last_5(col("arr"))).show(false)
df2 = df.withColumn(
    'seq',
    F.sequence(F.lit(0),F.size('collect_list') - 1)
).select('*',
    F.array_sort(
        F.arrays_zip(
            F.map_values(F.map_from_arrays('collect_list', 'seq')),
            F.map_keys(F.map_from_arrays('collect_list', 'seq'))
        )
    ).alias('list')
).select('*',
    F.when(
        F.size('list') >= 5, 
        F.slice('list',-5,5)
    ).otherwise(
        F.col('list')
    ).alias('list2')
).select('*',
    F.expr('transform(list2, x -> x["1"]) as window')
).select(
    'collect_list',
    F.array_sort('window').alias('window')
)

df2.show(truncate=False)
+------------------------------------------------+---------------+
|collect_list                                    |window         |
+------------------------------------------------+---------------+
|[F]                                             |[F]            |
|[F, B]                                          |[B, F]         |
|[F, B, A]                                       |[A, B, F]      |
|[F, B, A, F]                                    |[A, B, F]      |
|[F, B, A, F, B]                                 |[A, B, F]      |
|[F, B, A, F, B, G]                              |[A, B, F, G]   |
|[F, B, A, F, B, G, E]                           |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F]                        |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E]                     |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B]                  |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B, A]               |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B, A, D]            |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F]         |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E]      |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E]   |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D]|[A, B, D, E, F]|
+------------------------------------------------+---------------+