pyspark,获取第一列值等于id,第二列值在两个值之间的行,对数据帧中的每一行执行此操作
我有一个像这样的pyspark数据帧,我们称之为数据帧a:pyspark,获取第一列值等于id,第二列值在两个值之间的行,对数据帧中的每一行执行此操作,pyspark,apache-spark-sql,amazon-emr,Pyspark,Apache Spark Sql,Amazon Emr,我有一个像这样的pyspark数据帧,我们称之为数据帧a: +-------------------+---------------+----------------+ | reg| val1| val2 | +-------------------+---------------+----------------+ | N110WA| 1590030660|
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
另一个类似的,我们称之为数据帧b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
有没有办法创建一个numpy数组或pyspark数据帧,其中对于数据帧a中的每一行,都包括数据帧b中val 1和val 2之间具有相同reg和postime的所有行?有,假设
df_a
和df_b
都是pyspark数据帧,您可以在pyspark中使用一个内部连接:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
delta=val
df=df_a.连接(df_b[
df_a.res==df_b.res,
df_a.posttime=df_b.val2-增量
]“内部”)
将筛选出的结果仅包括指定的结果您可以尝试下面的解决方案,并告诉我们是否有效或是否需要其他解决方案 为了展示有效的解决方案,我对插补进行了一些修改-- 在此处输入
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
此处的解决方案
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
你能检查一下,让我知道你是否正在寻找这样的东西吗?三角洲的意义是什么?