PySpark:获取最后3天的数据

PySpark:获取最后3天的数据,pyspark,Pyspark,给定一个带有日期的数据框,我只想获取数据框中最近3天可用的行 |id|date| |1|2019-12-01| |2|2019-11-30| |3|2019-11-29| |4|2019-11-28| |5|2019-12-01| 应该回来 |id|date| |1|2019-12-01| |2|2019-11-30| |3|2019-11-29| |5|2019-12-01| 正在尝试使用此代码,但出现错误 df = sqlContext.createDataFrame([ (1,

给定一个带有日期的数据框,我只想获取数据框中最近3天可用的行

|id|date|
|1|2019-12-01|
|2|2019-11-30|
|3|2019-11-29|
|4|2019-11-28|
|5|2019-12-01|
应该回来

|id|date|
|1|2019-12-01|
|2|2019-11-30|
|3|2019-11-29|
|5|2019-12-01|
正在尝试使用此代码,但出现错误

df = sqlContext.createDataFrame([
    (1, '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl'),
    (2, '/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl'),
    (3, '/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl'),
    (4, '/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl'),
    (5, '/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl')
], ['id','partition'])

df = df.withColumn('date', F.regexp_extract('partition', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0))
dates = df.select('date').orderBy(F.desc('date')).distinct().limit(3).collect()

df.filter(df.date.isin(F.lit(dates))).show(10,False)
我得到的错误是

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [[2019-12-01], [2019-11-30], [2019-11-29]]
    at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
    at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
    at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
    at scala.util.Try.getOrElse(Try.scala:79)
    at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
    at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
    at org.apache.spark.sql.functions$.lit(functions.scala:110)
    at org.apache.spark.sql.functions.lit(functions.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

您可以使用
collect\u list
而不是
collect
dates
转换为单元素数组,并将其传递给
isin
,而不使用
lit
函数

dates = df.select('date').distinct().orderBy(F.desc('date')).limit(3).select(F.collect_list('date').alias('dates'))[0]

# ['2019-12-01', '2019-11-30', '2019-11-29']

df.filter(df.date.isin(dates)).show(10,False)
+---+----------------------------------------------------------------------------------+----------+
|id |partition                                                                         |date      |
+---+----------------------------------------------------------------------------------+----------+
|1  |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2  |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3  |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
+---+----------------------------------------------------------------------------------+----------+
函数以获得前n名

def get_top(df, column ,n):
     dates = df.select(column).distinct().orderBy(F.desc(column)).limit(n).select(F.collect_list(column)).first()[0]
     return dates

dates = get_top(df, 'date', 3)
# ['2019-12-01', '2019-11-30', '2019-11-29']

谢谢如果数据集有超过100000000条记录,这是一种有效的方法吗?@Chris:我总是不喜欢分割/分叉数据集,因为spark必须从开始或最后一个缓存点计算整个DAG。我不在系统前面,但我会发布一个我认为更有效的方法。谢谢。我正在用8'875'430'358条记录运行它。让我们看看这需要多长时间。目前只有一个执行器正在运行,由于内存不足,火花溢出到磁盘。我需要找到一个更好的方法。我已经读到手动重新分区会help@Chris:我认为添加虚拟列
lit
无助于将数据传输到一个分区。让我看看能不能找到别的办法。问:你有多少不同的年、月和日期?