Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用AWS胶水或Pypark过滤动态框架_Python_Python 2.7_Amazon Web Services_Pyspark_Aws Glue - Fatal编程技术网

Python 使用AWS胶水或Pypark过滤动态框架

Python 使用AWS胶水或Pypark过滤动态框架,python,python-2.7,amazon-web-services,pyspark,aws-glue,Python,Python 2.7,Amazon Web Services,Pyspark,Aws Glue,我的AWS Glue数据目录中有一个名为“mytable”的表。此表位于本地Oracle数据库连接“mydb”中 我只想将生成的DynamicFrame筛选到X_DATETIME_INSERT列(它是一个时间戳)大于某个时间(在本例中为“2018-05-07 04:00:00”)的行。之后,我将尝试对行进行计数,以确保计数较低(表中约有40000行,但只有少数行应满足筛选条件) 这是我目前的代码: import boto3 from datetime import datetime import

我的AWS Glue数据目录中有一个名为“mytable”的表。此表位于本地Oracle数据库连接“mydb”中

我只想将生成的DynamicFrame筛选到X_DATETIME_INSERT列(它是一个时间戳)大于某个时间(在本例中为“2018-05-07 04:00:00”)的行。之后,我将尝试对行进行计数,以确保计数较低(表中约有40000行,但只有少数行应满足筛选条件)

这是我目前的代码:

import boto3
from datetime import datetime
import logging
import os
import pg8000
import pytz
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from base64 import b64decode
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydb", table_name = "mytable", transformation_ctx = "datasource0")

# Try Glue native filtering    
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["X_DATETIME_INSERT"] > '2018-05-07 04:00:00')
filtered_df.count()
此代码运行20分钟并超时。我尝试过其他的变化:

df = datasource0.toDF()
df.where(df.X_DATETIME_INSERT > '2018-05-07 04:00:00').collect()


这些都失败了。我做错了什么?我对Python很有经验,但对Glue和PySpark还不熟悉。

AWS Glue将整个数据集从JDBC源代码加载到temp s3文件夹中,然后应用过滤。如果您的数据在s3中而不是Oracle中,并且由一些键(即/年/月/日)进行分区,那么您可以使用加载数据的子集:

val partitionPredicate = s"to_date(concat(year, '-', month, '-', day)) BETWEEN '${fromDate}' AND '${toDate}'"

val df = glueContext.getCatalogSource(
   database = "githubarchive_month",
   tableName = "data",
   pushDownPredicate = partitionPredicate).getDynamicFrame()

不幸的是,这对JDBC数据源还不起作用。

您能为同样的数据源发布支持文档吗
val partitionPredicate = s"to_date(concat(year, '-', month, '-', day)) BETWEEN '${fromDate}' AND '${toDate}'"

val df = glueContext.getCatalogSource(
   database = "githubarchive_month",
   tableName = "data",
   pushDownPredicate = partitionPredicate).getDynamicFrame()