Python 如何识别N个最后唯一的单元格?
我在数据框中有一列Python 如何识别N个最后唯一的单元格?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我在数据框中有一列ID(所有ID都被人的移动捕获)。我使用collect\u list创建了一个新列,以获取所有ID作为列表。如何获取所创建数组的最后N=5个元素 我使用一个UDF解决了这个问题,但是性能不是很好,我需要将数据帧分割成许多块。是否可以使用一些火花功能或其他技术来提高性能 index;collect_list;window 0;[F];[F] 1;[F, B];[B, F] 2;[F, B, A];[A, B, F] 3;[F, B, A, F];[A, B, F] 4;[F, B
ID
(所有ID都被人的移动捕获)。我使用collect\u list
创建了一个新列,以获取所有ID作为列表。如何获取所创建数组的最后N=5个元素
我使用一个UDF解决了这个问题,但是性能不是很好,我需要将数据帧分割成许多块。是否可以使用一些火花功能或其他技术来提高性能
index;collect_list;window
0;[F];[F]
1;[F, B];[B, F]
2;[F, B, A];[A, B, F]
3;[F, B, A, F];[A, B, F]
4;[F, B, A, F, B];[A, B, F]
5;[F, B, A, F, B, G];[A, B, F, G]
6;[F, B, A, F, B, G, E];[A, B, E, F, G]
7;[F, B, A, F, B, G, E, F];[A, B, E, F, G]
8;[F, B, A, F, B, G, E, F, E];[A, B, E, F, G]
9;[F, B, A, F, B, G, E, F, E, B];[A, B, E, F, G]
10;[F, B, A, F, B, G, E, F, E, B, A];[A, B, E, F, G]
11;[F, B, A, F, B, G, E, F, E, B, A, D];[A, B, D, E, F]
12;[F, B, A, F, B, G, E, F, E, B, A, D, F];[A, B, D, E, F]
13;[F, B, A, F, B, G, E, F, E, B, A, D, F, E];[A, B, D, E, F]
14;[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E];[A, B, D, E, F]
15;[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D];[A, B, D, E, F]
这是我在Python中为Pandas和Spark使用的函数:
def getLast(arr, n=5):
addedSet = set()
result = []
lenArr = len(arr)
for i in range(lenArr):
element = arr[lenArr-i-1]
if element not in addedSet:
result.append(element)
addedSet.add(element)
if len(addedSet) == 5 : break
return sorted(result)
我还在Scala中编写了该函数以提高性能,但它会因内存不足错误而崩溃:
def list_array_V(myArray: Array[String], n: Int = 5): Array[String] = {
@tailrec
def window(myArray: Array[String], currentSet: Set[String] = Set()): Array[String] = {
if (currentSet.size >= n) currentSet.toArray
else window(myArray.slice(0, myArray.length - 1), myArray.takeRight(1).toSet ++ currentSet)
}
window(myArray)
}
我的Spark版本是2.3。你能试试这个udf解决方案吗:
val df = spark.sql(""" with t1(
select array("F") as x union all
select array("F,B") as x union all
select array("F,B,A") as x union all
select array("F,B,A,F") as x union all
select array("F,B,A,F,B") as x union all
select array("F,B,A,F,B,G") as x union all
select array("F,B,A,F,B,G,E") as x union all
select array("F,B,A,F,B,G,E,F") as x union all
select array("F,B,A,F,B,G,E,F,E") as x union all
select array("F,B,A,F,B,G,E,F,E,B") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A,D") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E,E") as x union all
select array("F,B,A,F,B,G,E,F,E,B,A,D,F,E,E,D") as x
)
select x as arr from t1
""")
def last_5(arr :Seq[String]): Seq[String] = {
val x1=scala.collection.mutable.Set[String]()
arr.reverse.map( x => if(x1.size < 5 ) x1.add(x) )
x1.toSeq
}
val udf_last_5 = udf( last_5(_:Seq[String]) )
df.withColumn("result",udf_last_5(col("arr"))).show(false)
+------------------------------------------------+---------------+
|arr |result |
+------------------------------------------------+---------------+
|[F] |[F] |
|[F, B] |[B, F] |
|[F, B, A] |[B, F, A] |
|[F, B, A, F] |[B, F, A] |
|[F, B, A, F, B] |[B, F, A] |
|[F, B, A, F, B, G] |[B, F, G, A] |
|[F, B, A, F, B, G, E] |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F] |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E] |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B] |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A] |[B, F, G, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D] |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F] |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E] |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E] |[B, F, D, E, A]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D]|[B, F, D, E, A]|
+------------------------------------------------+---------------+
试试下面的代码?不需要自定义项;)
是的,对不起,请执行spark.sql('set spark.sql.mapkeydeduplicy=LAST_WIN')
您能解释一下为什么它比我的scala函数更快吗?一旦我们得到最后五个唯一项,数组遍历就停止了。我们正在使用break并退出for循环。在其他解决方案中,我们一直迭代到最后。
def last_5(arr :Seq[String]): Seq[String] = {
import scala.util.control.Breaks._
val x1=scala.collection.mutable.Set[String]()
breakable {
for ( x <- arr.reverse )
{
if(x1.size == 5 )
{
break
}
x1.add(x)
}
}
x1.toSeq
}
val udf_last_5 = udf( last_5(_:Seq[String]) )
df.withColumn("result",udf_last_5(col("arr"))).show(false)
df2 = df.withColumn(
'seq',
F.sequence(F.lit(0),F.size('collect_list') - 1)
).select('*',
F.array_sort(
F.arrays_zip(
F.map_values(F.map_from_arrays('collect_list', 'seq')),
F.map_keys(F.map_from_arrays('collect_list', 'seq'))
)
).alias('list')
).select('*',
F.when(
F.size('list') >= 5,
F.slice('list',-5,5)
).otherwise(
F.col('list')
).alias('list2')
).select('*',
F.expr('transform(list2, x -> x["1"]) as window')
).select(
'collect_list',
F.array_sort('window').alias('window')
)
df2.show(truncate=False)
+------------------------------------------------+---------------+
|collect_list |window |
+------------------------------------------------+---------------+
|[F] |[F] |
|[F, B] |[B, F] |
|[F, B, A] |[A, B, F] |
|[F, B, A, F] |[A, B, F] |
|[F, B, A, F, B] |[A, B, F] |
|[F, B, A, F, B, G] |[A, B, F, G] |
|[F, B, A, F, B, G, E] |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F] |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E] |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B] |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B, A] |[A, B, E, F, G]|
|[F, B, A, F, B, G, E, F, E, B, A, D] |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F] |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E] |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E] |[A, B, D, E, F]|
|[F, B, A, F, B, G, E, F, E, B, A, D, F, E, E, D]|[A, B, D, E, F]|
+------------------------------------------------+---------------+