如何在Python中使用Pandas从s3存储桶读取csv文件
我正在尝试使用以下代码将AWS S3存储桶中的CSV文件作为数据帧读入内存:如何在Python中使用Pandas从s3存储桶读取csv文件,python,amazon-web-services,pandas,amazon-s3,Python,Amazon Web Services,Pandas,Amazon S3,我正在尝试使用以下代码将AWS S3存储桶中的CSV文件作为数据帧读入内存: import pandas as pd import boto data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv') from boto.s3.key import Key k = Key(bucket) k.key = 'data_1.csv' k.set_canned_acl('pu
import pandas as pd
import boto
data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')
from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')
为了提供完全访问权限,我在S3 bucket上设置了bucket策略,如下所示:
{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
{
"Sid": "statement1",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::example_bucket"
}
]
data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')
}
不幸的是,我仍然在python中遇到以下错误:
boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed
想知道是否有人可以帮助解释如何在AWS S3中正确设置权限或正确配置pandas以导入文件。谢谢 我最终意识到,您还需要设置bucket中每个单独对象的权限,以便使用以下代码提取它:
import pandas as pd
import boto
data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')
from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')
我还必须在pd.read_csv命令中修改bucket的地址,如下所示:
{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
{
"Sid": "statement1",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::example_bucket"
}
]
data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')
你不需要熊猫。。您可以只使用python的默认csv库
def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
# reads a csv from AWS
# first you stablish connection with your passwords and region id
conn = boto.s3.connect_to_region(
region,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
# next you obtain the key of the csv you want to read
# you will need the bucket name and the csv file name
bucket = conn.get_bucket(bucket_name, validate=False)
key = Key(bucket)
key.key = remote_file_name
data = key.get_contents_as_string()
key.close()
# you store it into a string, therefore you will need to split it
# usually the split characters are '\r\n' if not just read the file normally
# and find out what they are
reader = csv.reader(data.split('\r\n'))
data = []
header = next(reader)
for row in reader:
data.append(row)
return data
希望它解决了你的问题,
祝你好运
:)使用熊猫0.20.3
import os
import boto3
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO # Python 2.x
else:
from io import StringIO # Python 3.x
# get your credentials from environment variables
aws_id = os.environ['AWS_ID']
aws_secret = os.environ['AWS_SECRET']
client = boto3.client('s3', aws_access_key_id=aws_id,
aws_secret_access_key=aws_secret)
bucket_name = 'my_bucket'
object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))
导入操作系统
进口boto3
作为pd进口熊猫
导入系统
如果系统版本信息[0]<3:
从StringIO导入StringIO#Python 2.x
其他:
从io导入StringIO#Python 3.x
#从环境变量获取凭据
aws_id=os.environ['aws_id']
aws_secret=os.environ['aws_secret']
client=boto3.client('s3',aws\u access\u key\u id=aws\u id,
aws_secret_access_key=aws_secret)
bucket\u name='my\u bucket'
object_key='my_file.csv'
csv\u obj=client.get\u对象(Bucket=Bucket\u name,Key=object\u Key)
body=csv_obj['body']
csv_string=body.read().decode('utf-8')
df=pd.read\u csv(StringIO(csv\u string))
基于建议用于从S3读取的方法,我将其用于Pandas:
import os
import pandas as pd
from smart_open import smart_open
aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']
bucket_name = 'my_bucket'
object_key = 'my_file.csv'
path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
您还可以尝试使用pandas read_sql和pyathena:
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://bucket/folder',region_name='region')
df = pd.read_sql('select * from database.table', conn) #don't change the "database.table"
s3后面不应该有双斜杠吗?是的,你是对的,应该有。我还必须更改bucket和文件的位置:tripData=pd.read\u csv('htps://s3-ap-southeast-2.amazonaws.com/example_bucket/data.csv'). 我必须更新个人文件的权限。但它现在起作用了。干杯。请添加您的解决方案作为帮助其他Stackoverflow用户的答案。当使用
read\u csv
从s3读取文件时,pandas是否首先本地下载到磁盘,然后加载到内存?或者它是从网络直接流到内存中的吗?如何修改地址,使之成为熊猫可以读取的url?您已经让世界上任何人都可以读取此文件,而大多数人可能应该避免这样做@上面jpobst的回答提供了正确的凭证来读取文件,这是大多数人应该做的。当我以这种方式导入文件时,df的列不会出现?我正在尝试这样做,并且在对os.environ的id和密钥调用中出错--这是我必须在终端中设置的还是什么?@ZachOakes是,这是你需要设置的东西。这两行假设您的ID和SECRET以前保存为环境变量,但不需要从环境变量中提取它们。相反,您可以用任何方法替换这两行代码,以便将ID和密码输入到代码中。也适用于DictReader:reader=csv.DictReader(io.StringIO(body),fieldnames=fieldnames)