Dataframe 基于列条件筛选pyspark数据帧行

Dataframe 基于列条件筛选pyspark数据帧行,dataframe,apache-spark,pyspark,Dataframe,Apache Spark,Pyspark,我有两个数据帧,我想在第一个数据帧中选择时间戳字段比第二个数据帧的最大值(时间戳)大(最近)的行。 我试过这个: df1 = sqlContext.table("db.table1") # FIRST DATAFRAME max_timestamp = sqlContext.sql("select max(timestamp) as max from db.table2") # MAX TIMESTAMP IN THE SECOND DATAFRAME df1.where(df1.times

我有两个数据帧,我想在第一个数据帧中选择时间戳字段比第二个数据帧的最大值(时间戳)大(最近)的行。 我试过这个:

df1 = sqlContext.table("db.table1")   # FIRST DATAFRAME
max_timestamp = sqlContext.sql("select max(timestamp) as max from db.table2") # MAX TIMESTAMP IN THE SECOND DATAFRAME
df1.where(df1.timestamp > max_timestamp.max).show(10,False)
但是它说:AttributeError:“DataFrame”对象没有属性“\u get\u object\u id”
有什么想法/解决方案吗?

您的问题是您正在与另一个数据帧的数据帧
(max\u timestamp.max)进行比较。您需要将
收集
结果作为
字符串
交叉连接
作为新列进行比较

可复制示例 收集为字符串 交叉连接为新列
请提供一个最小的可复制示例(具有预期输出)。
data1 = [("1", "2020-01-01 00:00:00"), ("2", "2020-02-01 23:59:59")]
data2 = [("1", "2020-01-15 00:00:00"), ("2", "2020-01-16 23:59:59")]
df1 = spark.createDataFrame(data1, ["id", "timestamp"])
df2 = spark.createDataFrame(data2, ["id", "timestamp"])
from pyspark.sql.functions import col, max
max_timestamp = df2.select(max(col("timestamp")).alias("max")).distinct().collect()[0][0]
max_timestamp
# '2020-01-16 23:59:59'
df1.where(col("timestamp") > max_timestamp).show(10, truncate=False)
# +---+-------------------+
# |id |timestamp          |
# +---+-------------------+
# |2  |2020-02-01 23:59:59|
# +---+-------------------+
from pyspark.sql.functions import col, max
intermediate = (
    df2.
        agg(max(col("timestamp")).alias("start_date_filter"))
)
intermediate.show(1, truncate=False)
# +-------------------+                                                           
# |start_date_filter  |
# +-------------------+
# |2020-01-16 23:59:59|
# +-------------------+
(
    df1.
        crossJoin(intermediate).
        where(col("timestamp") > col("start_date_filter")).
        show(10, truncate=False)
)
# +---+-------------------+-------------------+
# |id |timestamp          |start_date_filter  |
# +---+-------------------+-------------------+
# |2  |2020-02-01 23:59:59|2020-01-16 23:59:59|
# +---+-------------------+-------------------+