Dataframe 基于列条件筛选pyspark数据帧行
我有两个数据帧,我想在第一个数据帧中选择时间戳字段比第二个数据帧的最大值(时间戳)大(最近)的行。 我试过这个:Dataframe 基于列条件筛选pyspark数据帧行,dataframe,apache-spark,pyspark,Dataframe,Apache Spark,Pyspark,我有两个数据帧,我想在第一个数据帧中选择时间戳字段比第二个数据帧的最大值(时间戳)大(最近)的行。 我试过这个: df1 = sqlContext.table("db.table1") # FIRST DATAFRAME max_timestamp = sqlContext.sql("select max(timestamp) as max from db.table2") # MAX TIMESTAMP IN THE SECOND DATAFRAME df1.where(df1.times
df1 = sqlContext.table("db.table1") # FIRST DATAFRAME
max_timestamp = sqlContext.sql("select max(timestamp) as max from db.table2") # MAX TIMESTAMP IN THE SECOND DATAFRAME
df1.where(df1.timestamp > max_timestamp.max).show(10,False)
但是它说:AttributeError:“DataFrame”对象没有属性“\u get\u object\u id”
有什么想法/解决方案吗?您的问题是您正在与另一个数据帧的数据帧
列
(max\u timestamp.max)进行比较。您需要将收集
结果作为字符串
或交叉连接
作为新列进行比较
可复制示例
收集为字符串
交叉连接为新列
请提供一个最小的可复制示例(具有预期输出)。
data1 = [("1", "2020-01-01 00:00:00"), ("2", "2020-02-01 23:59:59")]
data2 = [("1", "2020-01-15 00:00:00"), ("2", "2020-01-16 23:59:59")]
df1 = spark.createDataFrame(data1, ["id", "timestamp"])
df2 = spark.createDataFrame(data2, ["id", "timestamp"])
from pyspark.sql.functions import col, max
max_timestamp = df2.select(max(col("timestamp")).alias("max")).distinct().collect()[0][0]
max_timestamp
# '2020-01-16 23:59:59'
df1.where(col("timestamp") > max_timestamp).show(10, truncate=False)
# +---+-------------------+
# |id |timestamp |
# +---+-------------------+
# |2 |2020-02-01 23:59:59|
# +---+-------------------+
from pyspark.sql.functions import col, max
intermediate = (
df2.
agg(max(col("timestamp")).alias("start_date_filter"))
)
intermediate.show(1, truncate=False)
# +-------------------+
# |start_date_filter |
# +-------------------+
# |2020-01-16 23:59:59|
# +-------------------+
(
df1.
crossJoin(intermediate).
where(col("timestamp") > col("start_date_filter")).
show(10, truncate=False)
)
# +---+-------------------+-------------------+
# |id |timestamp |start_date_filter |
# +---+-------------------+-------------------+
# |2 |2020-02-01 23:59:59|2020-01-16 23:59:59|
# +---+-------------------+-------------------+