Python Pyspark:将CSV转换为嵌套JSON
我是pyspark的新手。我需要将hdfs位置的一个大CSV文件转换为基于不同primaryId的多个嵌套JSON文件 示例输入:data.csvPython Pyspark:将CSV转换为嵌套JSON,python,json,csv,hadoop,pyspark,Python,Json,Csv,Hadoop,Pyspark,我是pyspark的新手。我需要将hdfs位置的一个大CSV文件转换为基于不同primaryId的多个嵌套JSON文件 示例输入:data.csv **PrimaryId,FirstName,LastName,City,CarName,DogName** 100,John,Smith,NewYork,Toyota,Spike 100,John,Smith,NewYork,BMW,Spike 100,John,Smith,NewYork,Toyota,Rusty 100,John,Smith,Ne
**PrimaryId,FirstName,LastName,City,CarName,DogName**
100,John,Smith,NewYork,Toyota,Spike
100,John,Smith,NewYork,BMW,Spike
100,John,Smith,NewYork,Toyota,Rusty
100,John,Smith,NewYork,BMW,Rusty
101,Ben,Swan,Sydney,Volkswagen,Buddy
101,Ben,Swan,Sydney,Ford,Buddy
101,Ben,Swan,Sydney,Audi,Buddy
101,Ben,Swan,Sydney,Volkswagen,Max
101,Ben,Swan,Sydney,Ford,Max
101,Ben,Swan,Sydney,Audi,Max
102,Julia,Brown,London,Mini,Lucy
示例输出文件:
File1:Output\u 100.json
{
"100": [
{
"City": "NewYork",
"FirstName": "John",
"LastName": "Smith",
"CarName": [
"Toyota",
"BMW"
],
"DogName": [
"Spike",
"Rusty"
]
}
}
File2:Output_101.json
{
"101": [
{
"City": "Sydney",
"FirstName": "Ben",
"LastName": "Swan",
"CarName": [
"Volkswagen",
"Ford",
"Audi"
],
"DogName": [
"Buddy",
"Max"
]
}
}
File3:Output_102.json
{
"102": [
{
"City": "London",
"FirstName": "Julia",
"LastName": "Brown",
"CarName": [
"Mini"
],
"DogName": [
"Lucy"
]
}
]
}
任何快速帮助都将不胜感激。您似乎需要根据身份证进行分组,并收集汽车和狗作为一组 从pyspark.sql.functions导入collect\u集
df = spark.read.format("csv").option("header", "true").load("cars.csv")
df2 = (
df
.groupBy("PrimaryId","FirstName","LastName")
.agg(collect_set('CarName').alias('CarName'), collect_set('DogName').alias('DogName'))
)
df2.write.format("json").save("cars.json", mode="overwrite")
生成的文件:
{"PrimaryId":"100","FirstName":"John","LastName":"Smith","CarName":["Toyota","BMW"],"DogName":["Spike","Rusty"]}
{"PrimaryId":"101","FirstName":"Ben","LastName":"Swan","CarName":["Ford","Volkswagen","Audi"],"DogName":["Max","Buddy"]}
{"PrimaryId":"102","FirstName":"Julia","LastName":"Brown","CarName":["Mini"],"DogName":["Lucy"]}
请告诉我这是否是您要查找的对象。您可以使用
pandas.groupby()
按Id分组,然后在DataFrameGroupBy
对象上迭代创建对象并写入文件
您需要通过$pip将pandas安装到您的virtualenv
# coding: utf-8
import json
import pandas as pd
def group_csv_columns(csv_file):
df = pd.read_csv(csv_file)
group_frame = df.groupby(['PrimaryId'])
for i in group_frame:
data_frame = i[1]
data = {}
data[i[0]] = [{
"City": data_frame['City'].unique().tolist()[0],
"FirstName": data_frame['FirstName'].unique().tolist()[0],
"CarName": data_frame['CarName'].unique().tolist(),
'DogName': data_frame['DogName'].unique().tolist(),
'LastName': data_frame['LastName'].unique().tolist()[0],
}]
# Write to file
file_name = 'Output_' + str(i[0]) + '.json'
with open(file_name, 'w') as fh:
contents = json.dumps(data)
fh.write(contents)
group_csv_columns('/tmp/sample.csv')
使用包含csv内容的文件名调用组csv列()
检查你已经试过了吗?你可以生成一组主ID,然后迭代每个条目,生成一组字典。我不是真正的编程背景,尝试过谷歌的一些解决方案,但这不能满足我的要求。这就是我寻求帮助的原因!!!我们可以将hdfs路径作为输入/输出而不是本地文件系统吗?您可以尝试使用hd.open(“/home/file.csv”)作为f:df=pd.read\u csv(f)
,检查此答案当然会尝试,但我认为此代码在hadoop环境中应该是有效的。感谢您的帮助。这只是一个示例文件,我必须在包含40-50个字段和数百万行的实际文件上使用此代码。