Arrays 使用jq将TSV文件转换为多个JSON数组_Arrays_Json_Bash_Csv_Jq

Arrays 使用jq将TSV文件转换为多个JSON数组

arrays json bash csv

Arrays 使用jq将TSV文件转换为多个JSON数组,arrays,json,bash,csv,jq,Arrays,Json,Bash,Csv,Jq,我想使用，但他们提供TSV格式，这不是很方便我想将TSV数据转换为JSON [ { "tconst": "tt0000247", "directors": [ "nm0005690", "nm0002504", "nm2156608" ], "writers": [ &q

我想使用，但他们提供TSV格式，这不是很方便

我想将TSV数据转换为JSON

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": [
      "\\N"
    ]
  }
]

我可以使用以下命令执行此操作：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

然而，该文件非常大，我得到了内存不足的错误

1。如何将此TSV文件拆分为几个JSON文件，每个文件包含1000条记录

./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json

2.如何排除空字段？有一个空数组

“编剧”：[“\\N”]

“编剧”：[]

UPD（第二个问题已解决）：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | 
         .[2] |= if .[0] == "\\N" then [] else . end | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

谢谢你的回答

他们以TSV格式提供服务，这不是很方便

实际上，jq和TSV配合得非常好，当然使用jq来处理TSV文件并不需要使用-s（“slurp”）选项，这确实是通常（但决不是总是）最好避免的

若您的目标只是生成一个“tconst”对象流，那个么您可以逐行处理TSV文件；如果您想将该流组装成一个数组，那么可以使用jq和-c选项生成每行一个JSON对象的流，然后使用

awk

等工具将它们组装在一起（即，只需添加开始括号和结束括号以及分隔逗号）

不过，在您的情况下，可能最简单的方法是首先拆分TSV文件（例如，使用unix/linux/mac

split

命令，请参见下文），然后按照jq程序的行处理每个文件。由于块非常小（每个1000个对象），您甚至可以将jq与-s选项一起使用，但使用

inputs

和-n命令行选项同样容易：

jq -n '[inputs]'

或者您可以将这些策略结合起来：分割成块，使用jq和-c选项处理每个块以生成一个流，并将每个这样的流组装成一个JSON数组

分裂有关将文件拆分为块的信息，请参见示例：

如果您可以选择

python

，那么如何利用它，因为python的数据结构与

json

具有很高的兼容性。请你试试：

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

输出：

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

[
{
“董事”：“nm0349785”，
“tconst”：“tt0000238”，
“作家”：[]
}, 
{
“董事”：“nm0349785”，
“tconst”：“tt0000239”，
“作家”：[]
}, 
{
“董事”：[]，
“tconst”：“tt0000240”，
“作家”：[]
},

由于

python

是一种通用编程语言，因此它具有很高的处理输入的灵活性。还可以很容易地将结果分割成多个json文件。

由于在当前上下文中1000是一个小数字，因此这里有一个解决方案不使用

split

；而是归结为一个两步管道

管道的第一部分包括使用-c选项调用jq（用于将TSV转换为JSON数组流，每个块一个）；这将在下面描述

管道的第二部分将此数组流转换为所需的文件集，每个文件一个数组；管道的这一部分可以使用

awk

或您选择的类似工具轻松实现，下面不再讨论

program.jq

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

# Assemble the items in the (possibly empty) stream into a 
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
  # box the input to detect eos
  foreach ((stream|[.]), null) as $item ({};
     (.array|length) as $l
     | if $item == null # eos
       then .emit = (0 < $l and $l < $n)
       else if $l == $n
            then .array = $item
            else .array += $item
            end
       | .emit = (.array|length == $n)
       end;

     if .emit then .array else empty end) ;


def stream:
  inputs
  | split("\t")
  | map_values(if . == "\\N" then "" else . end)
  | map(split(","))
  | { tconst: .[0][0],
      directors: .[1],
      writers:   .[2] };
      
assemble(stream; 1000)

jq -Rc -f program.jq input.tsv