Arrays 使用jq将TSV文件转换为多个JSON数组
我想使用,但他们提供TSV格式,这不是很方便 我想将TSV数据转换为JSONArrays 使用jq将TSV文件转换为多个JSON数组,arrays,json,bash,csv,jq,Arrays,Json,Bash,Csv,Jq,我想使用,但他们提供TSV格式,这不是很方便 我想将TSV数据转换为JSON [ { "tconst": "tt0000247", "directors": [ "nm0005690", "nm0002504", "nm2156608" ], "writers": [ &q
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": [
"\\N"
]
}
]
我可以使用以下命令执行此操作:
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
然而,该文件非常大,我得到了内存不足的错误
1。如何将此TSV文件拆分为几个JSON文件,每个文件包含1000条记录
./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json
2.如何排除空字段?有一个空数组
“编剧”:[“\\N”]
->“编剧”:[]
UPD(第二个问题已解决):
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] |
.[2] |= if .[0] == "\\N" then [] else . end | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
谢谢你的回答
他们以TSV格式提供服务,这不是很方便
实际上,jq和TSV配合得非常好,当然使用jq来处理TSV文件并不需要使用-s(“slurp”)选项,这确实是通常(但决不是总是)最好避免的
若您的目标只是生成一个“tconst”对象流,那个么您可以逐行处理TSV文件;如果您想将该流组装成一个数组,那么可以使用jq和-c选项生成每行一个JSON对象的流,然后使用awk
等工具将它们组装在一起(即,只需添加开始括号和结束括号以及分隔逗号)
不过,在您的情况下,可能最简单的方法是首先拆分TSV文件(例如,使用unix/linux/macsplit
命令,请参见下文),然后按照jq程序的行处理每个文件。由于块非常小(每个1000个对象),您甚至可以将jq与-s选项一起使用,但使用inputs
和-n命令行选项同样容易:
jq -n '[inputs]'
或者您可以将这些策略结合起来:分割成块,使用jq和-c选项处理每个块以生成一个流,并将每个这样的流组装成一个JSON数组
分裂
有关将文件拆分为块的信息,请参见示例:
如果您可以选择
python
,那么如何利用它,因为python的数据结构与json
具有很高的兼容性。请你试试:
#!/usr/bin/python
import json
ary = [] # declare an empty array
with open('./title.crew.tsv') as f:
header = f.readline().rstrip().split('\t') # read the header line and split
for line in f: # iterate the following lines
body = line.rstrip().split('\t')
d = {} # empty dictionary
for i in range(0, len(header)):
if ',' in body[i]: # if the value contains ","
b = body[i].split(',') # then split the value on it
else:
b = body[i]
if b == '\N': # if the value is "\N"
b = [] # then replace with an empty array
d[header[i]] = b # generate an object
ary.append(d) # append the object to the array
print(json.dumps(ary, indent=2))
输出:
[
{
"directors": "nm0349785",
"tconst": "tt0000238",
"writers": []
},
{
"directors": "nm0349785",
"tconst": "tt0000239",
"writers": []
},
{
"directors": [],
"tconst": "tt0000240",
"writers": []
},
<..SNIPPED..>
[
{
“董事”:“nm0349785”,
“tconst”:“tt0000238”,
“作家”:[]
},
{
“董事”:“nm0349785”,
“tconst”:“tt0000239”,
“作家”:[]
},
{
“董事”:[],
“tconst”:“tt0000240”,
“作家”:[]
},
由于
python
是一种通用编程语言,因此它具有很高的处理输入的灵活性。还可以很容易地将结果分割成多个json文件。由于在当前上下文中1000是一个小数字,因此这里有一个解决方案不使用split
;而是归结为一个两步管道
管道的第一部分包括使用-c选项调用jq(用于将TSV转换为JSON数组流,每个块一个);这将在下面描述
管道的第二部分将此数组流转换为所需的文件集,每个文件一个数组;管道的这一部分可以使用awk
或您选择的类似工具轻松实现,下面不再讨论
program.jq
#!/usr/bin/python
import json
ary = [] # declare an empty array
with open('./title.crew.tsv') as f:
header = f.readline().rstrip().split('\t') # read the header line and split
for line in f: # iterate the following lines
body = line.rstrip().split('\t')
d = {} # empty dictionary
for i in range(0, len(header)):
if ',' in body[i]: # if the value contains ","
b = body[i].split(',') # then split the value on it
else:
b = body[i]
if b == '\N': # if the value is "\N"
b = [] # then replace with an empty array
d[header[i]] = b # generate an object
ary.append(d) # append the object to the array
print(json.dumps(ary, indent=2))
[
{
"directors": "nm0349785",
"tconst": "tt0000238",
"writers": []
},
{
"directors": "nm0349785",
"tconst": "tt0000239",
"writers": []
},
{
"directors": [],
"tconst": "tt0000240",
"writers": []
},
<..SNIPPED..>
# Assemble the items in the (possibly empty) stream into a
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
# box the input to detect eos
foreach ((stream|[.]), null) as $item ({};
(.array|length) as $l
| if $item == null # eos
then .emit = (0 < $l and $l < $n)
else if $l == $n
then .array = $item
else .array += $item
end
| .emit = (.array|length == $n)
end;
if .emit then .array else empty end) ;
def stream:
inputs
| split("\t")
| map_values(if . == "\\N" then "" else . end)
| map(split(","))
| { tconst: .[0][0],
directors: .[1],
writers: .[2] };
assemble(stream; 1000)
jq -Rc -f program.jq input.tsv