Pyspark:根据regex筛选最近3天的数据
我有一个带有日期的数据框,希望筛选最近3天的数据(不是基于当前时间,而是数据集中可用的最新时间) 应该回来Pyspark:根据regex筛选最近3天的数据,pyspark,Pyspark,我有一个带有日期的数据框,希望筛选最近3天的数据(不是基于当前时间,而是数据集中可用的最新时间) 应该回来 +---+----------------------------------------------------------------------------------+----------+ |id |partition |date
+---+----------------------------------------------------------------------------------+----------+
|id |partition |date |
+---+----------------------------------------------------------------------------------+----------+
|1 |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2 |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3 |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
+---+----------------------------------------------------------------------------------+----------+
编辑:我使用@Lamanus answer从分区字符串中提取日期
df = sqlContext.createDataFrame([
(1, '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl'),
(2, '/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl'),
(3, '/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl'),
(4, '/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl'),
(5, '/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl')
], ['id','partition'])
df.withColumn('date', F.regexp_extract('partition', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.show(10, False)
出于您最初的目的,我认为您不需要日期特定的文件夹。因为文件夹结构已经被
dt
分区,所以将它们全部进行筛选
df = spark.createDataFrame([('1', '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl')]).toDF('id', 'value')
from pyspark.sql.functions import *
dates = df.withColumn('date', regexp_extract('value', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.withColumn('date', explode(sequence(to_date('date'), date_sub('date', 2)))) \
.select('date').rdd.map(lambda x: str(x[0])).collect()
path = df.withColumn('value', split('value', '/dt')[0]) \
.select('value').rdd.map(lambda x: str(x[0])).collect()
newDF = spark.read.json(path).filter(col(dt).isin(dates))
这是我的尝试
df = spark.createDataFrame([('1', '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl')]).toDF('id', 'value')
from pyspark.sql.functions import *
df.withColumn('date', regexp_extract('value', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.withColumn('date', explode(sequence(to_date('date'), date_sub('date', 2)))) \
.withColumn('value', concat(lit('.*/'), col('date'), lit('/.*'))).show(10, False)
+---+----------------+----------+
|id |value |date |
+---+----------------+----------+
|1 |.*/2019-12-01/.*|2019-12-01|
|1 |.*/2019-11-30/.*|2019-11-30|
|1 |.*/2019-11-29/.*|2019-11-29|
+---+----------------+----------+
分区路径是dataframe的数据路径还是列的字符串?很抱歉,这只是一串专栏嗨,我已经编辑了我的问题,很抱歉搞混了。我想你已经添加了我需要的一些很棒的组件,但也只需要选择最后3天的数据。另一种方法。我不需要爆炸。我所要做的就是从分区字符串中提取日期(使用regexp\u extract完成),然后选择最后3天的数据。试图编辑我的问题来说明它
df = spark.createDataFrame([('1', '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl')]).toDF('id', 'value')
from pyspark.sql.functions import *
df.withColumn('date', regexp_extract('value', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.withColumn('date', explode(sequence(to_date('date'), date_sub('date', 2)))) \
.withColumn('value', concat(lit('.*/'), col('date'), lit('/.*'))).show(10, False)
+---+----------------+----------+
|id |value |date |
+---+----------------+----------+
|1 |.*/2019-12-01/.*|2019-12-01|
|1 |.*/2019-11-30/.*|2019-11-30|
|1 |.*/2019-11-29/.*|2019-11-29|
+---+----------------+----------+