使用Python将数据序列化为json
我有70 GB的文件,大部分是TXT、CSV和日志,来源于公开披露的信息,用于研究、训练神经网络等。我想将文件中的每一行序列化为json,并推动弹性搜索以利用它。行可能包含json编码器应该转义的特殊字符,如俄语字母、韩语等。由于Apache Lucene文件大小的限制,我不能将一个10 GB的文件编码为一个对象并将其推送到elastic 大多数条目包含:使用Python将数据序列化为json,python,json,
elasticsearch,Python,Json,
elasticsearch,我有70 GB的文件,大部分是TXT、CSV和日志,来源于公开披露的信息,用于研究、训练神经网络等。我想将文件中的每一行序列化为json,并推动弹性搜索以利用它。行可能包含json编码器应该转义的特殊字符,如俄语字母、韩语等。由于Apache Lucene文件大小的限制,我不能将一个10 GB的文件编码为一个对象并将其推送到elastic 大多数条目包含: 9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79: 254:S
9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:
254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:
2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8
2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93
254:username:someemail@gstuff:880896502dd68b546258\][;.'54cca34
2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr
2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\^c
2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|
我希望获取文件的每一行(由新行分隔)并生成类似(转义非法的json字符):
解决这个问题的最佳方法是什么
import json
read_my_file = open("my_file.txt","r") #open your file, I copied and paste your example in my file
lines= read_my_file.readlines()#read each line separatelly
my_list=[]#create my new list of items
for i in lines:#do a for loop for all the element in lines
my_list.append({"data":i})#for each loop create a dictionary and append it on my list
print (my_list)#print my list to ensure that it's correct
my_json=json.dumps(my_list)#convert my list to json
print (my_json)#print my json
如果您需要更多详细信息,请告诉我;) 下面的代码不能读取内存中的所有内容。既然你今天谈论10Gb文件 可能很重要。我想这样做:
#!/usr/bin/env python3
import json
def convert2json(filename):
with open(filename) as I:
for line in I:
d = {"data": line}
print(json.dumps(d))
if __name__ == "__main__":
import sys
convert2json(sys.argv[1])
% python scriptname.py yourfile
{"data": "9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:\n"}
{"data": "254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:\n"}
{"data": "2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8\n"}
{"data": "2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93\n"}
{"data": "254:username:someemail@gstuff:880896502dd68b546258\\][;.'54cca34\n"}
{"data": "2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr\n"}
{"data": "2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\\^c\n"}
{"data": "2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|\n"}
#!/usr/bin/env python3
import json
def convert2json(filename):
with open(filename) as I:
for line in I:
d = {"data": line}
print(json.dumps(d))
if __name__ == "__main__":
import sys
convert2json(sys.argv[1])
% python scriptname.py yourfile
{"data": "9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:\n"}
{"data": "254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:\n"}
{"data": "2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8\n"}
{"data": "2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93\n"}
{"data": "254:username:someemail@gstuff:880896502dd68b546258\\][;.'54cca34\n"}
{"data": "2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr\n"}
{"data": "2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\\^c\n"}
{"data": "2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|\n"}