Python 如何在s3 parquet中编写json文件
无法在aws s3中创建拼花地板文件,但可以在本地创建。建议一个更好的方法。当我运行代码时,我可以在s3中创建一个json文件,但是当我尝试创建拼花地板文件时,得到了以下错误错误消息:“无效的文件路径或缓冲区对象类型:”,“错误类型:”,“ValueError”,“stackTrace:[[”/var/task/lambda_function.py),80,“lambda_处理程序”,“df.to_拼花地板”(location,engine='auto',compression='snappy',index=None)“确保您的s3\u对象是s3 url字符串。它必须看起来像这样Python 如何在s3 parquet中编写json文件,python,python-3.x,pandas,parquet,pyarrow,Python,Python 3.x,Pandas,Parquet,Pyarrow,无法在aws s3中创建拼花地板文件,但可以在本地创建。建议一个更好的方法。当我运行代码时,我可以在s3中创建一个json文件,但是当我尝试创建拼花地板文件时,得到了以下错误错误消息:“无效的文件路径或缓冲区对象类型:”,“错误类型:”,“ValueError”,“stackTrace:[[”/var/task/lambda_function.py),80,“lambda_处理程序”,“df.to_拼花地板”(location,engine='auto',compression='snappy'
“s3://my\u bucket/path/to/data\u folder/my file.parquet”
除此之外,不建议使用pandas将数据帧作为拼花写入S3。对于python 3.6+AWS,有一个名为AWS data wrangler的库,可以帮助实现pandas/S3/parquet之间的集成
安装do
import json
import requests
import datetime
import boto3
import parquet
import pyarrow
import pandas as pd
from pandas import DataFrame
noaa_codes = [
'KAST',
'KBDN',
'KCVO',
'KEUG',
'KHIO',
'KHRI',
'KMMV',
'KONP',
'KPDX',
'KRDM',
'KSLE',
'KSPB',
'KTMK',
'KTTD',
'KUAO'
]
urls = [f"https://api.weather.gov/stations/{x}/observations/latest" for x in noaa_codes]
s3_bucket="XXXXXX"
s3_prefix = "XXXXX/parquetfiles"
s3 = boto3.resource("s3")
def get_datetime():
dt = datetime.datetime.now()
return dt.strftime("%Y%m%d"), dt.strftime("%H:%M:%S")
def reshape(r):
props = r["properties"]
res = {
"stn": props["station"].split("/")[-1],
"temp": props["temperature"]["value"],
"dewp": props["dewpoint"]["value"],
"slp": props["seaLevelPressure"]["value"],
"stp": props["barometricPressure"]["value"],
"visib": props["visibility"]["value"],
"wdsp": props["windSpeed"]["value"],
"gust": props["windGust"]["value"],
"max": props["maxTemperatureLast24Hours"]["value"],
"min": props["minTemperatureLast24Hours"]["value"],
"prcp": props["precipitationLast6Hours"]["value"]
}
return res
def lambda_handler(event, context):
responses = []
for url in urls:
r = requests.get(url)
responses.append(reshape(r.json()))
datestr, timestr = get_datetime()
fname = f"noaa_hourly_measurements_{timestr}"
file_prefix = "/".join([s3_prefix, datestr, fname])
s3_obj = s3.Object(s3_bucket, file_prefix)`enter code here`
serialized = []
for r in responses:
serialized.append(json.dumps(r))
jsonlines_doc = "\n".join(serialized)
df= pd.read_json(jsonlines_doc,lines=True)
df.to_parquet(s3_obj, engine='auto', compression='snappy', index=None)
print("created")
要将df写入s3,请执行以下操作:
pip install awswrangler
我想下面的行需要更改:df=pd.read_json(jsonlines\u doc,lines=True)df.to_拼花地板(s3_obj,engine='auto',compression='snappy',index=None)
import awswrangler as wr
wr.s3.to_parquet(df=df, path="s3://my_bucket/path/to/data_folder/my-file.parquet")