Python 如何使用boto3将S3中的单个拼花文件读入熊猫数据帧？_Python_Pandas_Dataframe_Amazon S3_Boto3

Python 如何使用boto3将S3中的单个拼花文件读入熊猫数据帧？

python pandas dataframe amazon-s3

Python 如何使用boto3将S3中的单个拼花文件读入熊猫数据帧？,python,pandas,dataframe,amazon-s3,boto3,Python,Pandas,Dataframe,Amazon S3,Boto3,我正在尝试读取存储在S3 bucket中的单个拼花文件，并使用boto3将其转换为熊猫数据帧。找到了一种方法，可以使用boto3包将拼花文件简单地读取到数据帧中 import boto3 import io import pandas as pd # Read the parquet file buffer = io.BytesIO() s3 = boto3.resource('s3') object = s3.Object('my-bucket-name','path/to/parquet/

我正在尝试读取存储在S3 bucket中的单个拼花文件，并使用boto3将其转换为熊猫数据帧。

找到了一种方法，可以使用boto3包将拼花文件简单地读取到数据帧中

import boto3
import io
import pandas as pd

# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('my-bucket-name','path/to/parquet/file')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

print(df.head())

这里有关于使用PyArrow将拼花地板文件从S3存储桶读取到Pandas数据帧的信息：

导入pyarrow.parquet作为pq
导入S3F
数据集=pq.ParquetDataset（'s3://'，
filesystem=s3fs.S3FileSystem（），filters=[（'colA'，'='，'some_value'），（'colB'，'>='，some_number）]）
table=dataset.read（）
df=表到表（）

我更喜欢这种从S3读取拼花地板的方法，因为它鼓励通过过滤器参数使用拼花地板分区，但是有一个bug影响了这种方法。
可能更简单：

import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() df = pq.read_table('s3://blah/blah.parquet', filesystem=s3).to_pandas()

对于python 3.6+AWS，有一个名为的库，它有助于Pandas/S3/Parquet之间的集成
安装do

pip install awswrangler
要使用awswrangler 1.x.x及更高版本从s3读取单个拼花地板文件，请执行以下操作：

import awswrangler as wr df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")

这是一个拼花地板数据集，我相信它是一个文件夹。可能更简单：``导入pyarrow.parquet作为pq导入s3fs s3=s3fs.S3FileSystem（）df=pq.read_table（'s3://blah/blah.parquet'，filesystem=s3.）```
import awswrangler as wr df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")