Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/ant/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
PySpark Window.partitionBy()将相同的单元格值视为不同的_Pyspark_Apache Spark Sql_Window Functions - Fatal编程技术网

PySpark Window.partitionBy()将相同的单元格值视为不同的

PySpark Window.partitionBy()将相同的单元格值视为不同的,pyspark,apache-spark-sql,window-functions,Pyspark,Apache Spark Sql,Window Functions,我有一个dataframe,其中的列如下: df_lp.show() +--------------------+-------------+--------------------+------+ | ts| uid| pid|action| +--------------------+-------------+------------------

我有一个dataframe,其中的列如下:

df_lp.show()                               
+--------------------+-------------+--------------------+------+
|                  ts|          uid|                 pid|action|
+--------------------+-------------+--------------------+------+
|2017-03-28 09:34:...|1950269663250|IST334-0149064968...|    <L|
|2017-03-31 05:50:...|S578448696405|IST334-0149089179...|    <L|
|2017-03-28 09:38:...|1950269663250|IST334-0149064968...|    <L|
|2017-03-30 09:26:...| 412310802992|IST334-1212011845...|    <L|
我想为给定的
pid
action
查找第一个和最后一个时间戳。所以我这样做:

df_lp_pB = df_lp.filter(df_lp['action'] == '+B')\
                .select('pid', first_lp.alias('tsf'), last_lp.alias('tsl'))\
                .distinct().sort('pid')
然而,我得到的数据帧有额外的行,其中一个窗口函数似乎有中间值

df_lp_pB.filter('pid = "BAG26723881"').toPandas()
           pid                     tsf                     tsl
0  BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.674
1  BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
如果我对sparksql也这么做,那么它会按预期工作

df_lp.createOrReplaceTempView('scans_lp')
df_sql = spark.sql("SELECT pid , min(ts) AS tsf, max(ts) AS tsl FROM scans_lp \
                    WHERE action='+B' GROUP BY pid ORDER BY pid")
df_sql.filter('pid = "BAG26723881"').toPandas()
           pid                     tsf                     tsl
0  BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
事实上,当我将时间戳列的排序转换为降序时,另一个窗口函数就有问题了

ts_w_lp2 = Window.partitionBy(df_lp['pid']).orderBy(df_lp['ts'].desc())
first_lp2 = fns.min(df_lp['ts']).over(ts_w_lp2)
last_lp2 = fns.max(df_lp['ts']).over(ts_w_lp2)
df_lp_pB2 = df_lp.filter(df_lp['action'] == '+B')\
                 .select('pid', first_lp2.alias('tsf'), last_lp2.alias('tsl'))\
                 .distinct().sort('pid')
df_lp_pB2.filter('pid = "BAG26723881"').toPandas()
           pid                     tsf                     tsl
0  BAG26723881 2017-04-11 15:10:35.736 2017-04-11 15:10:35.736
1  BAG26723881 2017-04-11 15:10:35.674 2017-04-11 15:10:35.736
如果我进一步调查,我会发现我所有的
pid
s都被认为是不同的,即使对于行,它们也不是!见此:

df_lp.filter('action = "+B"').select('pid').distinct().count()
6382
df_sql.count()
6382
df_lp.filter('action = "+B"').select('pid').count()
120303
df_lp_pB.count()
120303
发生了什么事?我是否误解了Window.partitionBy()应该做什么

df_lp.filter('action = "+B"').select('pid').distinct().count()
6382
df_sql.count()
6382
df_lp.filter('action = "+B"').select('pid').count()
120303
df_lp_pB.count()
120303