Apache spark PySpark在时间戳之间加载文件_Apache Spark_Pyspark_Timestamp_Range_Wildcard

Apache spark PySpark在时间戳之间加载文件

apache-spark pyspark

Apache spark PySpark在时间戳之间加载文件,apache-spark,pyspark,timestamp,range,wildcard,Apache Spark,Pyspark,Timestamp,Range,Wildcard,我有一个xml文件列表，文件名中包含时间戳。我需要根据时间戳值有条件地加载这些文件。为此，我使用通配符下面是我正在使用的不起作用的代码： spark.read \ .format("com.databricks.spark.xml") \ .load("/path/file_[1533804409548-1533873609934]*") 我认为您不能使用通配符来执行此操作，因为您希望加载时间范围内的文件。由于可以从多个位置加载数据帧，因此只需创建一个时间范围内的文件路径数组并加载路

我有一个xml文件列表，文件名中包含时间戳。我需要根据时间戳值有条件地加载这些文件。为此，我使用通配符

下面是我正在使用的不起作用的代码：

spark.read \
  .format("com.databricks.spark.xml") \
  .load("/path/file_[1533804409548-1533873609934]*")

我认为您不能使用通配符来执行此操作，因为您希望加载时间范围内的文件。由于可以从多个位置加载数据帧，因此只需创建一个时间范围内的文件路径数组并加载路径即可。这是我试过的示例代码

target_files = []
st = 123
et = 321
path="<files_base_path>"
for file in os.listdir(path):
    try:
        ts = int(file[5:8])
        if ts >= st and ts <= et:
            target_files.append(path+file)
    except Exception as ex:
        continue
spark.read.parquet(*target_files)

target_文件=[]
st=123
et=321
path=“”
对于os.listdir（路径）中的文件：
尝试：
ts=int（文件[5:8]）
如果ts>=st和ts