使用PyCharm通过JDBC连接到AWS Athena-fetchSize问题_Pycharm_Jetbrains Ide_Amazon Athena_Datagrip

使用PyCharm通过JDBC连接到AWS Athena-fetchSize问题

pycharm

使用PyCharm通过JDBC连接到AWS Athena-fetchSize问题,pycharm,jetbrains-ide,amazon-athena,datagrip,Pycharm,Jetbrains Ide,Amazon Athena,Datagrip,我已经用我的PyCharm Pro版本连接到AWS雅典娜。它已成功连接，但无论何时运行查询，我都会得到：请求的fetchSize大于Athena中允许的值。请减小抓取大小，然后重试。参考雅典娜有效fetchSize值的文档我已经从下载了Athena JDBC驱动程序可能有什么问题？我认为您应该在此DataGrip设置中设置适当的值考虑到提取尺寸、JDBC和AWS雅典娜，有一件事要考虑。似乎有一个问题。我知道受欢迎的人把它当作自己的目标。所以，这可能是你的问题的一部分当我试图一次获

我已经用我的PyCharm Pro版本连接到AWS雅典娜。它已成功连接，但无论何时运行查询，我都会得到：

请求的fetchSize大于Athena中允许的值。请减小抓取大小，然后重试。参考雅典娜有效fetchSize值的文档

我已经从下载了Athena JDBC驱动程序

可能有什么问题？

我认为您应该在此DataGrip设置中设置适当的值

考虑到提取尺寸、JDBC和AWS雅典娜，有一件事要考虑。似乎有一个问题。我知道受欢迎的人把它当作自己的目标。所以，这可能是你的问题的一部分

当我试图一次获取超过1000行时，可能会产生获取大小错误

from pyathenajdbc import connect 
conn = connect(s3_staging_dir='s3://SOMEBUCKET/', 
region_name='us-east-1')
cur = conn.cursor()
cur.execute('SELECT * FROM SOMEDATABASE.big_table LIMIT 5000')
results = cur.fetchall()
print len(results)
# Note: The cursor class actually has a setter method to 
#       keep users from setting illegal fetch sizes   
cur._arraysize = 1001 # Set array size one greater than the default
cur.execute('SELECT * FROM athena_test.big_table LIMIT 5000')
results = cur.fetchall() # Generate an error

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

可能的解决办法包括：

在web GUI中运行查询，然后手动下载结果集

在您选择的编辑器/IDE（、Athena Web GUI等）中开发查询，并通过Python SDK将查询字符串传递给Athena。然后，您可以等待查询完成并获取结果集

执行查询并对结果进行分页

如果您是从Python调用SQL（我是从PyCharm标记推断出来的），那么可以使用类似PyAthenaJDBC的库，它将为您处理页面大小调整（参见上面的示例）对于我的许多Python脚本，我使用了与以下类似的工作流

import boto3
import time

sql = 'SELECT * from athena_test.big_table'

database = 'SOMEDATABASE'
bucket_name = 'SOMEBUCKET' 
output_path = '/home/zerodf/temp/somedata.csv'

client = boto3.client('athena')
config = {'OutputLocation': 's3://' + bucket_name + '/',
      'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}

execution_results = client.start_query_execution(QueryString = sql,
                                             QueryExecutionContext =
                                             {'Database': database},
                                             ResultConfiguration = config)

execution_id = str(execution_results[u'QueryExecutionId'])
remote_file = execution_id + '.csv'

while True:
    query_execution_results = client.get_query_execution(QueryExecutionId =
                                                     execution_id)
    if query_execution_results['QueryExecution']['Status']['State'] == u'SUCCEEDED':
        break
    else:
        time.sleep(60)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(remote_file, output_path)

显然，生产代码更复杂。

那么，如果我不能更改驱动程序的参数，我该怎么办呢？根据您文章中的标记，我假设您是Python开发人员。如果要运行查询并下拉结果，可以使用上面提供的任一示例。为了开发复杂的查询，我经常在IDE中的查询末尾添加

LIMIT 100

。这样，我就不用担心获取大量临时数据或降低IDE速度。